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Preface 


Today, images and video are everywhere. Online photo-sharing sites and social net- 
works have them in the billions. Search engines will produce images of just about any 
conceivable query Practically all phones and computers come with built-in cameras. 
It is not uncommon for people to have many gigabytes of photos and videos on their 
devices. 

Programming a computer and designing algorithms for understanding what is in these 
images is the field of computer vision. Computer vision powers applications like image 
search, robot navigation, medical image analysis, photo management, and many more. 

The idea behind this book is to give an easily accessible entry point to hands-on 
computer vision with enough understanding of the underlying theory and algorithms 
to be a foundation for students, researchers, and enthusiasts. The Python programming 
language, the language choice of this book, comes with many freely available, powerful 
modules for handling images, mathematical computing, and data mining. 

When writing this book, I have used the following principies as a guideline. The book 
should: 

• Be written in an exploratory style and encourage readers to follow the examples on 
their computers as they are reading the text. 

• Promote and use free and open Software with a low learning threshold. Python was 
the obvious choice. 

• Be complete and self-contained. This book does not cover all of computer vision 
but rather it should be complete in that all code is presented and explained. The 
reader should be able to reproduce the examples and build upon them directly 

• Be broad rather than detailed, inspiring and motivational rather than theoretical. 

In short, it should act as a source of inspiration for those interested in programming 
computer vision applications. 
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Prerequisites and OverView 

This book looks at theory and algorithms for a wide range of applications and problems. 
Here is a short summary of what to expect. 


WhatYou Need to Know 

• Basic programming experience. You need to know how to use an editor and run 
Scripts, how to structure code as well as basic data types. Familiarity with Python 
or other scripting languages like Ruby or Matlab will help. 

• Basic mathematics. To make full use of the examples, it helps if you know about 
matrices, vectors, matrix multiplication, and Standard mathematical functions and 
concepts like derivatives and gradients. Some of the more advanced mathematical 
examples can be easily skipped. 


WhatYou Will Leam 

• Hands-on programming with images using Python. 

• Computer vision techniques behind a wide variety of real-world applications. 

• Many of the fundamental algorithms and how to implement and apply them 
yourself 

The code examples in this book will show you object recognition, content-based 
image retrieval, image search, optical character recognition, optical flow, tracking, 3D 
reconstruction, stereo imaging, augmented reality pose estimation, panorama creation, 
image segmentation, de-noising, image grouping, and more. 

Chapter OverView 

Chapter 1, “Basic Image Handling and Processing” 

Introduces the basic tools for working with images and the Central Python modules 
used in the book. This chapter also covers many fundamental examples needed for 
the remaining chapters. 

Chapter 2, “Local Image Descriptors” 

Explains methods for detecting interest points in images and how to use them to 
find corresponding points and regions between images. 

Chapter 3, “Image to Image Mappings” 

Describes basic transformations between images and methods for computing them. 
Examples range from image warping to creating panoramas. 

Chapter 4, “Camera Models and Augmented Reality” 

Introduces how to model cameras, generate image projections from 3D space to 
image features, and estimate the camera viewpoint. 

Chapter 5, “Multiple View Geometry” 

Explains how to work with several images of the same scene, the fundamentals of 
multiple-view geometry, and how to compute 3D reconstructions from images. 
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Chapter 6, “Clustering Images” 

Introduces a number of clustering methods and shows how to use them for group- 
ing and organizing images based on similarity or content. 

Chapter 7, “Searching Images” 

Shows how to build efficient image retrieval techniques that can store image rep- 
resentations and search for images based on their visual content. 

Chapter 8, “Classifying Image Content” 

Describes algorithms for classifying image content and how to use them to recog- 
nize objects in images. 

Chapter 9, “Image Segmentation” 

Introduces different techniques for dividing an image into meaningful regions 
using clustering, user interactions, or image models. 

Chapter 10, “OpenCV” 

Shows how to use the Python interface for the commonly used OpenCV computer 
Vision library and how to work with video and camera input. 

There is also a bibliography at the back of the book. Citations of bibliographic entries 
are made by number in square brackets, as in [20]. 

Introduction to Computer Vision 

Computer vision is the automated extraction of information from images. Information 
can mean anything from 3D models, camera position, object detection and recognition 
to grouping and searching image content. In this book, we take a wide definition of 
computer vision and include things like image warping, de-noising, and augmented 
reality^ 

Sometimes computer vision tries to mimic human vision, sometimes it uses a data and 
statistical approach, and sometimes geometry is the key to solving problems. We will 
try to cover all of these angles in this book. 

Practical computer vision contains a mixofprogramming, modeling, and mathematics 
and is sometimes difficult to grasp. I have deliberately tried to present the material 
with a minimum of theory in the spirit of “as simple as possible but no simpler.” 
The mathematical parts of the presentation are there to help readers understand the 
algorithms. Some chapters are by nature very math-heavy (Chapters 4 and 5, mainly). 
Readers can skip the math if they like and stili use the example code. 

Python and NumPy 

Python is the programming language used in the code examples throughout this book. 
Python is a ciear and concise language with good support for input/output, numer- 
ics, images, and plotting. The language has some peculiarities, such as indentation 

^ These examples produce new images and are more image processing than actually extracting information from 
images. 
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and compact syntax, that take getting used to. The code examples assume you have 
Python 2.6 or later, as most packages are only available for these versions. The upcom- 
ing Python 3.x version has many language differences and is not backward compatible 
with Python 2.x or compatible with the ecosystem of packages we need (yet). 

Some familiarity with basic Python will make the material more accessible for read- 
ers. For beginners to Python, Mark Lutz’ book Learning Python [20] and the online 
documentation at http://www.python.org/ are good starting points. 

When programming computer vision, we need representations of vectors and matrices 
and operations on them. This is handled by Pythons NumPy module, where both vectors 
and matrices are represented by the array type. This is also the representation we will 
use for images. A good NumPy reference is Travis Oliphants free book Guide to NumPy 
[24]. The documentation at http://numpy.scipy.org/ is also a good starting point if you 
are new to NumPy. For visualizing results, we will use the Matplotlib module, and for 
more advanced mathematics, we will use SciPy. These are the Central packages you will 
need and will be explained and introduced in Chapter 1. 

Besides these Central packages, there will be many other free Python packages used 
for specific purposes like reading JSON or XML, loading and saving data, generating 
graphs, graphics programming, web demos, classifiers, and many more. These are 
usually only needed for specific applications or demos and can be skipped if you are 
not interested in that particular application. 

It is worth mentioning IPython, an Interactive Python shell that makes debugging 
and experimentation easier. Documentation and downloads are available at 
http://ipython.org/. 

Notation and Conventions 

Code looks like this: 

# some points 

X = [100,100,400,400] 

y = [200,500,200,500] 

# plot the points 
plot(x,y) 

The following typographical conventions are used in this book: 

Italic 

Used for definitions, filenames, and variable names. 

Constant width 

Used for functions, Python modules, and code examples. It is also used for console 
printouts. 

Hyperlink 

Used for URLs. 

Plain text 

Used for everything else. 
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Mathematical formulas are given inline like this /(x) = w^x -\-b or centered indepen- 
dendy: 

/(x) = ^ WiXi + b 
i 

and are only numbered when a reference is needed. 

In the mathematical sections, we will use lowercase {s,r, X,d, . . .) for scalars, upper- 
case {A, V, H, . . (or matrices (including I for the image as an array), and lowercase 
hold (t, c, . . .) for vectors. We will use x = [a:, y] and X = [X, Y , Z] to mean points in 
2D (images) and 3D, respectively 

Using Code Examples 

This hook is here to help you get your job done. In general, you may use the code in 
this book in your programs and documentation. You do not need to contact us for 
permission unless youVe reproducing a significant portion of the code. For example, 
writing a program that uses several chunks of code from this book does not require 
permission. Selling or distributing a CD-ROM of examples from 0’Reilly books does 
require permission. Answering a question by citing this book and quoting example 
code does not require permission. Incorporating a significant amount of example code 
from this book into your products documentation does require permission. 

We appreciate, but do not require, attribution. An attribution usually includes the title, 
author, publisher, and ISBN. For example: “Programming Computer Vision with Python 
by Jan Erik Solem (0’Reilly). Copyright © 2012 Jan Erik Solem, 978-1-449-31654-9.” 

If you feel your use of code examples falis outside fair use or the permission given above, 
feel free to contact us at permissions@oreilly.com. 


How to Contact Us 

Please address comments and questions concerning this book to the publisher: 

0’Reilly Media, Inc. 

1005 Gravenstein Flighway North 
Sebastopol, CA 95472 

(800) 998-9938 (in the United States or Canada) 

(707) 829-0515 (international or local) 

(707) 829-0104 (fax) 

We have a web page for this book, where we list errata, examples, links to the code and 
data sets used, and any additional information. You can access this page at: 

oreil.ly/comp_vision_w_python 

To comment or ask technical questions about this book, send email to: 
bookquestions@oreilly. com 
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For more information about our books, courses, conferences, and news, see our website 
at http://www.oreilly.com. 

Find us on Facebook: http://facebook.com/oreilly 

Follow us on Twitter: http://twitter.com/oreillymedia 

Watch us on YouTube: http://www.youtube.com/oreillymedia 
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zations, government agencies, and individuals. Subscribers have access to thousands of 
books, training videos, and prepublication manuscripts in one fully searchable data- 
base from publishers like 0’Reilly Media, Prentice Flall Professional, Addison-Wesley 
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CHAPTER1 


Basic Image Handiing 
and Processing 


This chapter is an introduction to handiing and processing images. With extensive 
examples, it explains the Central Python packages you will need for working with 
images. This chapter introduces the basic tools for reading images, converting and 
scaling images, computing derivatives, plotting or saving results, and so on. We will 
use these throughout the remainder of the book. 

1.1 PIL—The Python Imaging Library 

The Python Imaging Library (PIL) provides general image handiing and lots of useful 
basic image operations like resizing, cropping, rotating, color conversion and much 
more. PIL is free and available from http://www.pythonware.com/products/pil/. 

With PIL, you can read images from most formats and write to the most common ones. 
The most important module is the Image module. To read an image, use: 

from PIL import Image 

pil_im = Image.open(' empire.jpg' ) 

The return value, pil_im, is a PIL image object. 

Color conversions are done using the convertf) method. To read an image and convert 
it to grayscale, just add convert (' L' ) like this: 

pil_im = Image.open('empire.jpg').convert('L') 

Here are some examples taken from the PIL documentation, available at http://www 
.pythonware.com/library/pil/handbook/index.htm. Output from the examples is shown 
in Figure IT. 

Convert Images to Another Format 

Using the save() method, PIL can save images in most image file formats. Heres an 
example that takes all image files in a list of filenames (filelist) and converts the images 
to JPEG files: 


1 




Figure 1-1. Examples of processing images with PIL. 


from PIL import Image 
import os 

for infile in filelist: 

outfile = os.path.splitext(infile)[o] + ".jpg" 
if infile != outfile: 
try: 

Image. open (infile). save (outfile) 
except lOError: 

print "cannot convert", infile 

The PIL function open() creates a PIL image object and the save() method saves the 
image to a file with the given filename. The new filename will be the same as the original 
with the file ending “.jpg” instead. PIL is smart enough to determine the image format 
from the file extension. There is a simple check that the file is not already a JPEG file 
and a message is printed to the console if the conversion fails. 

Throughout this book we are going to need lists of images to process. Heres how you 
could create a list of filenames of all images in a folder. Create a file called imtools.py to 
store some of these generally useful routines and add the following function: 

import os 

def get_imlist(path): 

Returns a list of filenames for 
all jpg images in a directory. """ 

return [os.path.join(path,f) for f in os.listdir(path) if f.endswith('.jpg')] 

Now, back to PIL. 


Create Thumbnaiis 

Using PIL to create thumbnaiis is very simple. The thuiribnail() method takes a tuple 
specifying the new size and converts the image to a thumbnail image with size that fits 
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within the tuple. To create a thumbnail with longest side 128 pixels, use the method 
like this: 

pil_im.thumbnail((l 28 , 128 )) 

Copyand Paste Regions 

Cropping a region from an image is done using the crop() method: 

box = (100,100,400,400) 
region = pil_im.crop(box) 

The region is defined by a 4-tuple, where coordinates are (left, upper, right, lower). PIL 
uses a coordinate system with (0, 0) in the upper left corner. The extracted region can, 
for example, be rotated and then put back using the paste() method like this: 

region = region.transpose(Image.R 0 TATE_l 8 o) 
pil_im.paste(region,box) 


Resize and Rotate 

To resize an image, call resize () with a tuple giving the new size: 
out = pil_im.resize((l 28 , 128 )) 

To rotate an image, use counterclockwise angi es and rotate () like this: 
out = pil_im.rotate( 45 ) 

Some examples are shown in Figure IT. The leftmost image is the original, followed 
by a grayscale version, a rotated crop pasted in, and a thumbnail image. 

1.2 Matplotiib 

When working with mathematics and plotting graphs or drawing points, lines, and 
curves on images, Matplotiib is a good graphics library with much more powerful 
features than the plotting available in PIL. Matplotiib produces high-quality figures 
like many of the illustrations used in this book. Matplotlibs PyLab interface is the 
set of functions that allows the user to create plots. Matplotiib is open source and 
available freely from http://matplotlib.sourceforge.net/, where detailed documentation 
and tutorials are available. Here are some examples showing most of the functions we 
will need in this book. 

Plotting Images, Points, and Lines 

Although it is possible to create nice bar plots, pie charts, scatter plots, etc., only a few 
commands are needed for most computer vision purposes. Most importantly we want 
to be able to show things like interest points, correspondences, and detected objects 
using points and lines. Here is an example of plotting an image with a few points and 
a line: 
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from PIL import Image 
from pylab import * 

# read image to array 

im = array(Image.open( 'empire.jpg' )) 

# plot the image 
imshow(im) 

# some points 

X = [100,100,400,400] 
y = [200,500,200,500] 

# plot the points with red star-markers 
plot(x,y,'r*') 

# line plot connecting the first tuo points 
plot(x[: 2 ],y[: 2 ]) 

# add title and shou the plot 
title('Plotting: "empire.jpg"') 
show() 

This plots the image, then four points with red star markers at the x and y coordinates 
given by the x and y lists, and finally draws a line (blue by default) between the two 
first points in these lists. Figure 1-2 shows the resuit. The show() command starts the 
figure GUI and raises the figure Windows. This GUI loop blocks your scripts and they 
are paused until the last figure window is closed. You should call show() only once per 
script, usually at the end. Note that PyLab uses a coordinate origin at the top left corner 
as is common for images. The axes are useful for debugging, but if you want a prettier 
plot, add: 

axis('off') 

This will give a plot like the one on the right in Figure 1-2 instead. 

There are many options for formatting color and styles when plotting. The most useful 
are the short commands shown in Tables 1-1, 1-2 and 1-3. Use them like this: 


plot(x,y) # default blue solid line 

plot(x,y,'r*') # red star-markers 

plot(x,y,'go-') # green line uith circle-markers 

plot(x,y,'ks:') # black dotted line uith square-markers 


Image Contours and Histograms 

Lets look at two examples of special plots: image contours and image histograms. 
Visualizing image iso-contours (or iso-contours of other 2D functions) can be very 
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Figure 1-2. Examples of plotting with Matplotlib. An image with points and a line with and without 
showing the axes. 


Table 1-1. Basic color formatting commands for plotting with PyLab. 


Color 

'b' 

blue 

'g' 

green 

'r ' 

red 

'c' 

cyan 

'm' 

magenta 

'y' 

yellow 

'k' 

black 

'w' 

white 


Table 1-2. Basic line style formatting commands for plotting with PyLab. 

Line style 

' -' solid 

' - -' dashed 

':' dotted 

Table 1-3. Basic plot marker formatting commands for plotting with PyLab. 

Marker 



point 

o' 

circle 

s' 

square 

*' 

star 

+ ' 

plus 

x' 

X 
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useful. This needs grayscale images, because the contours need to be taken on a single 
value for every coordinate [x, y]. Heres how to do it: 

from PIL import Image 
from pylab import * 

# read image to array 

im = array(Image.open( 'empire.jpg' ).convert('L')) 

# create a new figure 
figureO 

# don't use colors 
grayO 

# Show contours with origin upper left corner 
contour(im, origin='image') 

axis('equal') 
axis('off') 

As before, the PIL method convert() does conversion to grayscale. 

An image histogram is a plot showing the distribution of pixel values. A number of 
bins is specified for the span of values and each hin gets a count of how many pixels 
have values in the bins range. The visualization of the (graylevel) image histogram is 
done using the hist() function: 

figureO 

hist(im.flatten 0 , 128 ) 
showO 

The second argument specifies the number of bins to use. Note that the image needs to 
be flattened first, because hist() takes a one-dimensional array as input. The method 
flattenO converts any array to a one-dimensional array with values taken row-wise. 
Figure 1-3 shows the contour and histogram plot. 



Figure 1-3. Examples of visualizing image contours and plotting image histograms with Matplotlib. 
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Interactive Annotation 


Sometimes users need to interact with an application, for example by marking points 
in an image, or you need to annotate some training data. PyLab comes with a simple 
function, ginput(), that lets you do just that. Heres a short example: 

from PIL import Image 
from pylab import * 

im = array(Image.open( 'empire.jpg' )) 
imshow(im) 

print 'Please click 3 points' 

X = ginput( 3 ) 

print 'you clicked:' ,x 

show() 

This plots an image and waits for the user to click three times in the image region of 
the figure window. The coordinates [x, y] of the clicks are saved in a list x. 

1.3 NumPy 

NumPy {http://www.scipy.org/NumPy/) is a package popularly used for scientific comput- 
ing with Python. NumPy contains a number of useful concepts such as array objects (for 
representing vectors, matrices, images and much more) and linear algebra functions. 
The NumPy array object will be used in almost all examples throughout this book.^ The 
array object lets you do important operations such as matrix multiplication, transpo- 
sition, solving equation systems, vector multiplication, and normalization, which are 
needed to do things like aligning images, warping images, modeling variations, classi- 
fying images, grouping images, and so on. 

NumPy is freely available from http://www.scipy.org/Download and the online documen- 
tation (http://docs.scipy.org/doc/numpy/) contains answers to most questions. For more 
details on NumPy, the freely available book [24] is a good reference. 

Array Image Representation 

When we loaded images in the previous examples, we converted them to NumPy array 
objects with the array() call but didn’t mention what that means. Arrays in NumPy are 
multi-dimensional and can represent vectors, matrices, and images. An array is much 
like a list (or list of lists) but is restricted to having all elements of the same type. Unless 
specifled on creation, the type will automatically be set depending on the data. 

The following example illustrates this for images: 

im = array(Image.open('empire.jpg')) 
print im.shape, im.dtype 

im = array(Image.open('empire.jpg').convert('L'),'f') 
print im.shape, im.dtype 


^ Pylab actually includes some components of NumPy, like the array type. That’s why we could use it in the 
examples in Section 1.2. 
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The printout in your console will look like this: 

( 800 , 569, 3) uints 
(800, 569) float 32 

The first tuple on each line is the shape of the image array (rows, columns, color 
channels), and the following string is the data type of the array elements. Images 
are usually encoded with unsigned 8-bit integers (uintS), so loading this image and 
converting to an array gives the type “uintS” in the first case. The second case does 
grayscale conversion and creates the array with the extra argument “f”. This is a short 
command for setting the type to floating point. For more data type options, see [24]. 
Note that the grayscale image has only two values in the shape tuple; obviously it has 
no color Information. 

Elements in the array are accessed with indexes. The value at coordinates i, j and color 
channel k are accessed like this: 


value = im[i,j,k] 

Multiple elements can be accessed using array slicing. Slicing returns a view into the 
array specified by intervals. Here are some examples for a grayscale image: 


ini[i,:] = 
im[:,i] = 100 
im[: 100 ,: 50 ].sum() 
ini[ 50 : 100 ,50:100] 
ini[i] .mean() 
im[:,-l] 

im[- 2 ,:] (or im[- 2 ]) 


# set the values of row i with values from row j 

# set all values in column i to 100 

# the sum of the values of the first 100 rows and 50 columns 
it rows 50 - 100 , columns 50-100 (lOOth not included) 

it average of row i 

# last column 

it second to last row 


Note the example with only one index. If you only use one index, it is interpreted as the 
row index. Note also the last examples. Negative indices count from the last element 
backward. We will frequently use slicing to access pixel values, and it is an important 
concept to understand. 

There are many operations and ways to use arrays. We will introduce them as they are 
needed throughout this book. See the online documentation or the book [24] for more 
explanations. 


Grayievel Transforms 

After reading images to NumPy arrays, we can perform any mathematical operation we 
like on them. A simple example of this is to transform the graylevels of an image. Take 
any function / that maps the interval 0 . . . 255 (or, if you like, 0 . . . 1) to itself (meaning 
that the output has the same range as the input). Here are some examples: 

from PIL import Image 
from numpy import * 

im = array(Image.open( 'empire.jpg' ).convert(' L')) 
im 2 = 255 - im # invert image 
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im 3 = (100.0/255) * im + 100 # claittp to interval 100...200 
im 4 = 255.0 * (im/ 255 .o )**2 # squared 

The first example inverts the graylevels of the image, the second one clamps the intensi- 
ties to the interval 100 . . . 200, and the third applies a quadratio function, which lowers 
the values of the darker pixels. Figure 1-4 shows the functions and Figure 1-5 the result- 
ing images. You can check the minimum and maximum values of each image using: 

print int(im.min()), int(im.max()) 


f{x)=255-x 

f{x) = (100.0/255.0)*x-l-100 
f{x)=255.0*(x/255.0)"'2 


50 100 150 200 


Figure 1-4. Example of graylevel transforms. Three example functions together with the identity 
transform showed as a dashed line. 











If you try that for each of the examples above, you should get the following output: 

2 255 
0 253 
100 200 
0 255 

The reverse of the array() transformation can be done using the PIL function 
fromarrayO as: 

pil_im = Image.fromarray(im) 

If you did some operation to change the type from “uintS” to another data type, such 
as im3 or im4 in the example above, you need to convert back before creating the PIL 
image: 

pil_im = Image.fromarray(uint 8 (im)) 

If you are not absolutely sure of the type of the input, you should do this as it is the safe 
choice. Note that NumPy will always change the array type to the “lowest” type that can 
represent the data. Multiplication or division with floating point numbers will change 
an integer type array to float. 

Image Resizing 

NumPy arrays will be our main tool for working with images and data. There is no simple 
way to resize arrays, which you will want to do for images. We can use the PIL image 
object conversion shown earlier to make a simple image resizing function. Add the 
following to imtools.py. 

def imresize(im,sz): 

Resize an image array using PIL. """ 
pil_im = Image.fromarray(uint 8 (im)) 

return array(pil_im.resize(sz)) 

This function will come in handy later. 

Histogram Equalization 

A very useful example of a graylevel transform is histogram equalization. This transform 
flattens the graylevel histogram of an image so that all intensities are as equally common 
as possible. This is often a good way to normalize image intensity before further 
Processing and also a way to increase image contrast. 

The transform function is, in this case, a cumulative distribution function (cdf) of the 
pixel values in the image (normalized to map the range of pixel values to the desired 
range). 

Heres how to do it. Add this function to the file imtools.py. 

def histeq(im,nbr_bins= 256 ): 

Histogram egualization of a grayscale image. """ 
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# get image histogram 

imhistjbins = histogram(ini.flatten(),nbr_bins,normed=True) 
cdf = imhist.cumsumO # cmulative distributiori function 
cdf = 255 * cdf / cdf[-l] # normalize 

# use lineor interpolation of cdf to find nevi pixel values 
im2 = interp(im.flatten(),bins[:-l],cdf) 

return im2.reshape(im.shape), cdf 

The function takes a grayscale image and the number of bins to use in the histogram 
as input, and returns an image with equalized histogram together with the cumulative 
distribution function used to do the mapping of pixel values. Note the use of the last 
element (index -1) of the cdf to normalize it between 0 ... 1. Try this on an image like 
this: 


from PIL import Image 
from numpy import * 

im = array(Image.open( 'AquaTermi_lowcontrast.jpg' ).convert('L')) 
im2,cdf = imtools.histeq(im) 

Figures 1-6 and 1-7 show examples of histogram equalization. The top row shows the 
graylevel histogram before and after equalization together with the cdf mapping. As you 
can see, the contrast increases and the details of the dark regions now appear clearly 

Averaging Images 

Averaging images is a simple way of reducing image noise and is also often used for 
artistic effects. Computing an average image from a list of images is not difficult. 
Assuming the images all have the same size, we can compute the average of all those 
images by simply summing them up and dividing with the number of images. Add the 
following function to imtools.py. 

def compute_average(imlist): 

Compute the average of a list of images. """ 

# open first image and make into array of type float 
averageim = array(Image.open(imlist[o]), 'f') 

for imname in imlist[l:]: 
try: 

averageim += array(Image.open(imname)) 
except: 

print imname + '...skipped' 
averageim /= len(imlist) 

# return average as uintS 
return array(averageim, 'uintS') 

This includes some basic exception handling to skip images that eant be opened. There 
is another way to compute average images using the mean() function. This requires all 
images to be stacked into an array and will use lots of memory if there are many images. 
We will use this function in the next section. 
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Figure 1-6. Example of histogram equalization. On the left is the original image and histogram. The 
middle plot is the gmylevel transform function. On the right is the image and histogram after histogram 
equalization. 



Figure 1-7. Example of histogram equalization. On the left is the original image and histogram. The 
middle plot is the graylevel transform function. On the right is the image and histogram after histogram 
equalization. 
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PCAof Images 

Principal Component Analysis (PCA) is a useful technique for dimensionality reduction 
and is optimal in the sense that it represents the variability of the training data with 
as few dimensions as possible. Even a tiny 100 x 100 pixel grayscale image has 10,000 
dimensions, and can be considered a point in a 10,000-dimensional space. A megapixel 
image has dimensions in the millions. With such high dimensionality, it is no surprise 
that dimensionality reduction comes in handy in many computer vision applications. 
The projection matrix resulting from PCA can be seen as a change of coordinates to a 
coordinate system where the coordinates are in descending order of importance. 

To apply PCA on image data, the images need to be converted to a one-dimensional 
vector representation using, for example, NumPys flatten() method. 

The flattened images are collected in a single matrix by stacking them, one row for each 
image. The rows are then centered relative to the mean image before the computation 
of the dominant directions. To find the principal components, singular value decom- 
position (SVD) is usually used, but if the dimensionality is high, there is a useful trick 
that can be used instead since the SVD computation will be very slow in that case. Here 
is what it looks like in code: 

from PIL import Image 
from numpy import * 

def pca(X): 

Principal Component Analysis 

input: X, matrix with training data stored as flattened arrays in rows 
return: projection matrix (with important dimensions first), variance 
and mean.""" 

# get dimensions 
num_data,dim = X.shape 

# center data 

mean_X = X.mean(axis=o) 

X = X - mean_X 

if dim>num_data: 

# PCA - compact trick used 

M = dot(X,X.T) # covariance matrix 

e,EV = linalg.eigh(M) # eigenvalues and eigenvectors 

tmp = dot(X.T,EV).T # this is the compact trick 

V = tmp[::-l] # reverse since last eigenvectors are the ones we want 
S = sqrt(e)[::-l] # reverse since eigenvalues are in increasing order 
for i in range(V.shape[l]): 

V[:,i] /= S 
else: 

# PCA - SVD used 
LI,S,V = linalg.svd(X) 

V = V[:num_data] # only makes sense to return the first num_data 

# return the projection matrix, the variance and the mean 
return V,S,mean_X 
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This function first centers the data by subtracting the mean in each dimension. Then 
the eigenvectors corresponding to the largest eigenvalues of the covariance matrix are 
computed, either using a compact trick or using SVD. Here we used the function 
range(), which takes an integer n and returns a list of integers 0 ... (n — 1). Feel free to 
use the alternative arangef), which gives an array, or xrangef), which gives a generator 
(and might give speed improvements). We will stick with range() throughout the book. 

We switch from SVD to use a trick with computing eigenvectors of the (smaller) 
covariance matrix XX^ if the number of data points is less than the dimension of the 
vectors. There are also ways of only computing the eigenvectors corresponding to the k 
largest eigenvalues {k being the number of desired dimensions), making it even faster. 
We leave this to the interested reader to explore, since it is really outside the scope of this 
book. The rows of the matrix V are orthogonal and contain the coordinate directions 
in order of descending variance of the training data. 

Lets try this on an example of font images. The file fontimages.zip contains small 
thumbnail images of the character “a” printed in different fonts and then scanned. The 
2,359 fonts are from a collection of freely available fonts.^ Assuming that the filenames 
of these images are stored in a list, imlist, along with the previous code, in a file pca.py, 
the Principal components can be computed and shown like this: 

from PIL import Image 
from numpy import * 
from pylab import * 
import pca 

im = array(Image.open(imlist[o])) # open one image to get size 
m,n = im.shape[0:2] # get the size of the images 
imnbr = len(imlist) # get the number of images 

# create matrix to store ali flattened images 
immatrix = array([array(Image.open(im)).flatten() 
for im in imlist],'f') 


# perform PCA 

V,S,immean = pca.pca(immatrix) 

# shou some images (mean and 7 first modes) 
figureO 

grayO 

subplot(2,4,l) 

imshow(immean.reshape(m,n)) 
for i in range(7): 
subplot(2,4,i+2) 
imshow(V[i].reshape(m,n)) 

show() 


^ Images courtesy of Martin Solli (http://webstaff.itnliu.se/~marso/) collected and rendered from publicly avail¬ 
able free fonts. 
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Figure 1-8. The mean image (top left) and the first seven modes; that is, the directions with most 
variation. 


Note that the images need to be converted back from the one-dimensional represen- 
tation using reshape(). Running the example should give eight images in one figure 
window like the ones in Figure 1-8. Here we used the PyLab function subplot() to place 
multiple plots in one window. 

Using the Pickle Module 

If you want to save some results or data for later use, the pickle module, which comes 
with Python, is very useful. Pickle can take almost any Python object and convert it to 
a string representation. This process is called pickling. Reconstructing the object from 
the string representation is conversely called unpickling. This string representation can 
then be easily stored or transmitted. 

Lets illustrate this with an example. Suppose we want to save the image mean and 
Principal components of the font images in the previous section. This is done like this: 

# save mean and principal components 
f = open('font_pca_modes.pkl', 'wb') 
pickle.dump(immean,f) 

pickle.dump(V,f) 
f.close() 

As you can see, several objects can be pickled to the same file. There are several different 
protocols available for the .pkl files, and if unsure, it is best to read and write binary files. 
To load the data in some other Python session, just use the load() method like this: 

# load mean and principal components 
f = open('font_pca_modes.pkl', 'rb') 
immean = pickle.load(f) 

V = pickle.load(f) 
f.close() 

Note that the order of the objects should be the same! There is also an optimized version 
written in C called cpickle that is fully compatible with the Standard pickle module. 
More details can be found on the pickle module documentation page http:lldocs.python 
.orgllibrarylpickle.html. 
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For the remainder of this book, we will use the with statement to handle file reading 
and writing. This is a construet that was introduced in Python 2.5 that automatically 
handles opening and closing of files (even if errors occur while the files are open). Here 
is what the saving and loading ahove looks like using with(): 

# open file and save 

with open('font_pca_modes.pkl', 'wb') as f: 
pickle. dunip(immean, f) 
pickle.dump(V,f) 


and: 


# open file and load 

with open('font_pca_modes.pkl', 'rb') as f: 
immean = pickle.load(f) 

V = pickle.load(f) 

This might look strange the first time you see it, hut it is a very useful construet. If you 
dont like it, just use the open and close functions as above. 

As an alternative to using pickle, NumPy also has simple functions for reading and writing 
text files that can be useful if your data does not contain complicated structures, for 
example a list of points clicked in an image. To save an array x to file, use: 

savetxt (' test. txt %i') 

The last parameter indicates that integer format should be used. Similarly reading is 
done like this: 

X = loadtxt( 'test.txt' ) 

You can find out more from the Online documentation http://docs.scipy.org/doc/numpy/ 
reference/generated/numpy. loadtxt.html. 

Finally NumPy has dedicated functions for saving and loading arrays. Look for save() 
and load() in the Online documentation for the details. 

1.4 SciPy 

SciPy {http://scipy.org/) is an open-source package for mathematics that builds on 
NumPy and provides efficient routines for a number of operations, including numerical 
integration, optimization, statistics, signal processing, and most importantly for us, 
image processing. As the following will show, there are many useful modules in SciPy. 
SciPy is free and available at http://scipy.org/Download. 

BIurring Images 

A classic and very useful example of image convolution is Gaussian blurring of images. 
In essence, the (grayscale) image I is convolved with a Gaussian kernel to create a 
blurred version 
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where * indicates convolution and is a Gaussian 2D-kernel with Standard deviation 
a defined as 

In a 

Gaussian blurring is used to define an image scale to work in, for interpolation, for 
computing interest points, and in many more applications. 

SciPy comes with a module for filtering called scipy.ndimage.filters that can be used 
to compute these convolutions using a fast ID separation. All you need to do is this: 

from PIL import Image 

from numpy import * 

from scipy.ndimage import filters 

im = array(Image.open( 'empire.jpg' ).convert('L')) 
im 2 = filters.gaussian_filter(im,5) 

Here the last parameter of gaussian_filter() is the Standard deviation. 

Figure 1-9 shows examples of an image blurred with increasing a. Larger values give 
less detail. To blur color images, simply apply Gaussian blurring to each color channel: 

im = array(Image.open('empire.jpg')) 
im2 = zeros(im.shape) 
for i in range(3): 

im 2 [:,:,i] = filters.gaussian_filter(im[:,:,i],5) 
im2 = uint8(im2) 

Here the last conversion to “uintS” is not always needed but forces the pixel values to 
be in 8-bit representation. We could also have used 

im2 = array(im2,'uintS') 

for the conversion. 



(a) (b) (c) (d) 


Figure 1-9. An example of Gaussian blurring using the scipy.ndimage.filters module: (a) original 
image in grayscale; (b) Gaussian filter with a — 2; (c) with a — 5; (d) with a — 10. 
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For more information on using this module and the different parameter choices, 
check out the SciPy documentation of scipy.ndimage at http:lldocs.scipy.orgldoclscipyl 
reference/ndimage. html. 


Image Derivatives 

How the image intensity changes over the image is important information and is used 
for many applications, as we will see throughout this book. The intensity change is 
described with the x and y derivatives 1^ and ly of the graylevel image / (for color 
images, derivatives are usually taken for each color channel). 

The image gradient is the vector V/ = [1^, ly}^. The gradient has two important 
properties, the gradient magnitude 



which describes how strong the image intensity change is, and the gradient angle 

a = arctan2(/y, 4), 

which indicates the direction of largest intensity change at each point (pixel) in the 
image. The NumPy function arctan2() returns the signed angle in radians, in the interval 

— 7T . . . TT. 

Computing the image derivatives can be done using discrete approximations. These are 
most easily implemented as convolutions 
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These derivative filters are easy to implement using the Standard convolution available 
in the scipy.ndimage.filters module. For example: 


from PIL import Image 

from numpy import * 

from scipy.ndimage import filters 

im = array(Image.open( 'empire.jpg' ).convert('L')) 
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# Sobel derivative filters 
imx = zeros(im.shape) 
filters. sobel (im,l,inix) 

imy = zeros(im.shape) 
filters.sobel(im,0,imy) 

magnitude = sqrt(imx**2+imy**2) 

This computes x and y derivatives and gradient magnitude using the Sobel filter. The 
second argument selects the x or y derivative, and the third Stores the output. Figure 1-10 
shows an image with derivatives computed using the Sobel filter. In the two derivative 
images, positive derivatives are shown with bright pixels and negative derivatives are 
dark. Cray areas have values close to zero. 

Using this approach has the drawback that derivatives are taken on the scale determined 
by the image resolution. To be more robust to image noise and to compute derivatives 
at any scale, Gaussian derivative filters can be used: 

dx - ^ * ^<JX tind ly - / ^ G^y , 

where G^x ^nd ^ are the x and y derivatives of Go-, a Gaussian function with Standard 

deviation a. 

The filters.gaussian_filter() function we used for blurring earlier can also take extra 
arguments to compute Gaussian derivatives instead. To try this on an image, simply do: 

sigma = 5 # Standard deviation 
imx = zeros(im.shape) 

filters.gaussian_filter(im, (sigma,sigma), (o,l), imx) 
imy = zeros(im.shape) 

filters.gaussian_filter(im, (sigma,sigma), (l,o), imy) 

The third argument specifies which order of derivatives to use in each direction using 
the Standard deviation determined by the second argument. See the documentation 



Figure 1-10. An example ofcomputing image derivatives using Sobel derivative filters: (a) original image 
in grayscale; (b) x-derivative; (c) y-derivative; (d) gradient magnitude. 
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Figure 1-11. An example ofcomputing image derivatives using Gaussian derivatives: x-derivative (top), 
y-derivative (middle), and gradient magnitude (bottom); (a) original image in grayscale, (b) Gaussian 
derivative fiher with a — 2, (c) with a = 5, (d) with a — 10. 


for the details. Figure 1-11 shows the derivatives and gradient magnitude for different 
scales. Compare this to the blurring at the same scales in Figure 1-9. 


Morphology—Counting Objects 

Morphology (or mathematica! morphology) is a framework and a collection of image 
Processing methods for measuring and analyzing basic shapes. Morphology is usually 
applied to binary images but can be used with grayscale also. A binary image is an 
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image in which each pixel takes only two values, usually 0 and 1. Binary images are 
often the resuit of thresholding an image, for example with the intention of counting 
objects or measuring their size. A good summary of morphology and how it works is 
in http://en.wikipedia.org/wiki/Mathematical_morphology. 

Morphological operations are included in the scipy.ndimage module morphology. 
Counting and measurement functions for binary images are in the scipy.ndimage mod¬ 
ule measurements. Lets look at a simple example of how to use them. 

Consider the binary image in Figure l-12a.^ Counting the objects in that image can be 
done using: 

from scipy.ndimage import measurements,morphology 

# load image and threshold to make sure it is binary 
im = array(Image.open('houses.png').convert(' L')) 

im = l*(im<128) 

labeis, nbr_objects = measurements.label(im) 
print "Number of objects:", nbr_objects 

This loads the image and makes sure it is binary by thresholding. Multiplying by 1 con- 
verts the boolean array to a binary one. Then the function label() finds the individual 
objects and assigns integer labeis to pixels according to which object they belong to. 
Figure l-12b shows the labeis array The graylevel values indicate object index. As you 
can see, there are small connections between some of the objects. Using an operation 
called binary opening, we can remove them: 

# morphology - opening to separate objects better 

im_open = morphology.binary_opening(im,ones((9,5)),iterations=2) 

labels_open, nbr_objects_open = measurements.label(im_open) 
print "Number of objects:", nbr_objects_open 

The second argument of binary_opening() specifies the structuring element, an array 
that indicates what neighbors to use when centered around a pixel. In this case, we 
used 9 pixels (4 above, the pixel itself, and 4 below) in the y direction and 5 in the 
X direction. You can specify any array as structuring element; the non-zero elements 
will determine the neighbors. The parameter iterations determines how many times to 
apply the operation. Try this and see how the number of objects changes. The image 
after opening and the corresponding label image are shown in Figure l-12c-d. As you 
might expect, there is a function named binary_closing() that does the reverse. We 
leave that and the other functions in morphology and measurements to the exercises. You 
can learn more about them from the scipy. ndimage documentation http://docs.scipy.org/ 
doc/scipy/refer ence/ndimage. html. 


^ This image is actually the resuit of image “segmentatiori.” Take a look at Section 9.3 if you want to see how 
this image was created. 
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Figure 1-12. An example of morphology. Binary opening to separate objects followed hy counting 
them: (a) original binary image; (b) label image corresponding to the original, grayvalues indicate 
object index; (c) binary image after opening; (d) label image corresponding to the opened image. 

UsefulSciPy Modules 

SciPy comes with some useful modules for input and output. Two of them are io 
and misc. 

Reading and writing .mat files 

If you have some data, or find some interesting data set online, stored in Matlabs .mat 
file format, it is possible to read this using the scipy.io module. This is how to do it: 

data = scipy.io.loadmat('test.mat') 

The object data now contains a dictionary with keys corresponding to the variable 
names saved in the original .mat file. The variables are in array format. Saving to .mat 
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files is equally simple. Just create a dictionary with all variables you want to save and 
use savemat(): 

data = {} 
data['x'] = X 

scipy.io.savemat('test.mat',data) 

This saves the array x so that it has the name “x” when read into Matlab. More 
Information on scipy.io can be found in the Online documentation, http://docs.scipy 
. org/doc/sdpy/reference/io.html. 

Saving arrays as images 

Since we are manipulating images and doing computations using array objects, it is 
useful to be able to save them directly as image files."^ Many images in this book are 
created just like this. 

The imsaveO function is available through the scipy.misc module. To save an array im 
to file just do the following: 

from scipy.misc import imsave 
imsave( 'test.jpg' ,im) 

The scipy.misc module also contains the famous “Lena” test image: 
lena = scipy.misc.lena() 

This will give you a 512 x 512 grayscale array version of the image. 

1.5 Advanced Example: Image De-Noising 

We conclude this chapter with a very useful example, de-noising of images. Image de- 
noising is the process of removing image noise while at the same time trying to preserve 
details and structures. We will use the Rudin-Osher-Fatemi de-noising model (ROF) 
originally introduced in [28]. Removing noise from images is important for many 
applications, from making your holiday photos look better to improving the quality 
of satellite images. The ROF model has the interesting property that it finds a smoother 
version of the image while preserving edges and structures. 

The underlying mathematics of the ROF model and the solution techniques are quite 
advanced and outside the scope of this book. We’ll give a brief, simplified introduction 
before showing how to implement a ROF solver based on an algorithm by Cham- 
bolle [5]. 

The total variation (TV) of a (grayscale) image I is defined as the sum of the gradient 
norm. In a continuous representation, this is 

/(/) = j |V/|Jx. (1.1) 

All PyLab figures can be saved in a multitude of image formats by clicking the “save” button in the figure 
window. 
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In a discrete setting, the total variation becomes 

/(/) = ^|V/|, 

X 

where the sum is taken over all image coordinates x = [x, _y]. 

In the Chambolle version of ROF, the goal is to find a de-noised image U that minimizes 

mjn ||/ - U\\^ + 2XJ{U), 

where the norm || / — C/|| measures the difference between U and the original image 
I. What this means is, in essence, that the model looks for images that are “flat” but 
allows “jumps” at edges between regions. 

Following the recipe in the paper, heres the code: 

from numpy import * 

def denoise(im,U_init,tolerance=0.1,tau=0.125,tv_weight=100): 

An implementation of the Rudin-Osher-Fatemi (ROF) denoising model 
using the numerical procedure presented in eq (ll) A. Chambolle ( 2005 ). 

Input: noisy input image (grayscale), initial guess for U, weight of 
the TV-regularizing teim, steplength, tolerance for stop criterion. 

Output: denoised and detextured image, texture residual. """ 

m,n = im.shape It size of noisy image 

tt initialize 
U = U_init 

Px = im # x-component to the dual field 
Py = im # y-component of the dual field 
error = 1 

while (error > tolerance): 

Uold = U 

# gradient of primal variable 

GradUx = roll(U,-l,axis=l)-LI tt x-component of U's gradient 
GradUy = roll(U,-l,axis=o)-LI tt y-component of U's gradient 

tt update the dual varible 
PxNew = Px + (tau/tv_weight)*GradUx 
PyNew = Py + (tau/tv_weight)*GradUy 
NormNew = maximum(l,sqrt(PxNew**2+PyNew**2)) 

Px = PxNew/NormNew tt update of x-component (dual) 

Py = PyNew/NormNew tt update of y-component (dual) 

tt update the primal variable 

RxPx = roll(Px,l,axis=l) tt right x-translation of x-component 
RyPy = roll(Py,l,axis=o) tt right y-translation of y-component 

DivP = (Px-RxPx)+(Py-RyPy) tt divergence of the dual field. 
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U = im + tv_weight*DivP # update of the primal variable 
# update of error 

error = linalg.norm(U-Llold)/sqrt(n*m); 

return LI,im-LI # denoised image and texture residual 

In this example, we used the function roll(), which, as the name suggests, “rolls” the 
values of an array cyclically around an axis. This is very convenient for computing 
neighbor differences, in this case for derivatives. We also used linalg.norm(), which 
measures the difference between two arrays (in this case, the image matrices U and 
Uold). Save the function denoise() in a file rof.py. 

Lets start with a synthetic example of a noisy image: 

from numpy import * 
from numpy import random 
from scipy.ndimage import filters 
import rof 

# create synthetic image with noise 
im = zeros((500,500)) 
im[l00:400, 100 : 400 ] = 128 
im[200:300, 200 : 300 ] = 255 

im = im + 30*random.standard_normal(( 500 , 500 )) 

LI,T = rof .denoise(im,im) 

G = filters.gaussian_filter(im, 10 ) 

# save the resuit 

from scipy.misc import imsave 
imsave('synth_rof.pdf' ,LI) 
imsave('synth_gaussian.pdf',G) 

The resulting images are shown in Figure 1T3 together with the original. As you can 
see, the ROF version preserves the edges nicely 



Figure 1-13. An example of ROF de-noising of a synthetic example: (a) original noisy image; (b) image 
after Gaussian blurring (a — 10); (c) image after ROF de-noising. 
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Figure 1-14. An example of ROF de-noising of a grayscale image: (a) original image; (b) image after 
Gaussian blurring (a — 5); (c) image after ROF de-noising. 


Now, lets see what happens with a real image: 

from PIL import Image 
from pylab import * 
import rof 

im = array(Image.open( 'empire.jpg' ).convert('L')) 

U,T = rof.denoise(im,im) 

figureO 
grayO 
imshow(LI) 
axis('equal') 
axis('off') 
show() 

The resuit should look something like Figure l-14c, which also shows a blurred version 
of the same image for comparison. As you can see, ROF de-noising preserves edges and 
image structures while at the same time blurring out the “noise.” 


Exercises 

1. Take an image and apply Gaussian blur like in Figure 1-9. Plot the image contours 
for increasing values of cr. What happens? Can you explain why? 

2. Implement an unsharp masking operation {http://en.wikipedia.org/wiki/Unsharp_ 
masking) by blurring an image and then subtracting the blurred version from the 
original. This gives a sharpening effect to the image. Try this on both color and 
grayscale images. 
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3. An alternative image normalization to histogram equalization is a quotient image. A 
quotient image is obtained by dividing the image with a blurred version 1/(1 * G^). 
Implement this and try it on some sample images. 

4. Write a function that finds the outline of simple objects in images (for example, a 
square against white background) using image gradients. 

5. Use gradient direction and magnitude to detect lines in an image. Estimate the 
extent of the lines and their parameters. Plot the lines overlaid on the image. 

6. Apply the label() function to a thresholded image of your choice. Use histograms 
and the resulting label image to plot the distribution of object sizes in the image. 

7. Experiment with successive morphological operations on a thresholded image of 
your choice. When you have found some settings that produce good results, try the 
function center_of_mass in morphology to find the center coordinates of each object 
and plot them in the image. 

Conventions for the Code Examples 

From Chapter 2 and onward, we assume PIL, NumPy, and Matplotlib are included at the 
top of every file you create and in every code example as: 

from PIL import Image 
from numpy import * 
from pylab import * 

This makes the example code cleaner and the presentation easier to follow. In the cases 
when we use SciPy modules, we will explidtly declare that in the examples. 

Purists will object to this type of blanket imports and insist on something like 

import numpy as np 

import matplotlib.pyplot as plt 

so that namespaces can be kept (to know where each function comes from) and only 
import the pyplot part of Matplotlib, since the NumPy parts imported with Pylab are not 
needed. Purists and experienced programmers know the difference and can choose 
whichever option they prefer. In the interest of making the content and examples in 
this book easily accessible to readers, I have chosen not to do this. 

Caveat emptor. 
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CHAPTER2 


Local Image Descriptors 


This chapter is about finding corresponding points and regions between images. Two 
different types of local descriptors are introduced with methods for matching these 
between images. These local features will be used in many different contexts throughout 
this book and are an important building block in many applications, such as creating 
panoramas, augmented reality and computing 3D reconstructions. 


2.1 Harris Comer Detector 


The Harris corner detection algorithm (or sometimes the Harris & Stephens corner 
detector) is one of the simplest corner indicators available. The general idea is to locate 
interest points where the surrounding neighborhood shows edges in more than one 
direction; these are then image corners. 


We define a matrix M/ = M/(x), on the points x in the image domain, as the positive 
semi-definite, symmetric matrix 


= V /vr = 


r /v 1 
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441 

X 
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41 = 

X 
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( 2 . 1 ) 


whereas before V/ is the image gradient containing the derivatives 4 and ly (we defined 
the derivatives and the gradient on page 18). Because of this construction, M/ has rank 


one with eigenvalues Xi = 
the image. 


I V/p and A.2 = 0. We now have one matrix for each pixel in 


Let W be a weight matrix (typically a Gaussian filter G„). The component-wise 
convolution 


M/ = IT * (2.2) 

gives a local averaging of Mj over the neighboring pixels. The resulting matrix Mj 
is sometimes called a Harris matrix. The width of W determines a region of interest 
around x. The idea of averaging the matrix M/ over a region like this is that the 
eigenvalues will change depending on the local image properties. If the gradients vary 
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in the region, the second eigenvalue of M/ will no longer be zero. If the gradients are 
the same, the eigenvalues will be the same as for M/. 

Depending on the values of V/ in the region, there are three cases for the eigenvalues 
of the Harris matrix, M/i 

• If Aj and both large positive values, then there is a corner at x. 

• If Ai is large and ~ 0, then there is an edge and the averaging of M/ over the 
region doesnt change the eigenvalues that much. 

• If A]^ ~ A 2 ~ 0, then there is nothing. 

To distinguish the important case from the others without actually having to compute 
the eigenvalues, Harris and Stephens [12] introduced an indicator function 

det(M/) — K tracefM/)^. 

To get rid of the weighting constant k, it is often easier to use the quotient 

det(M/) 

tracefM/)^ 

as an indicator. 

Lets see what this looks like in code. For this, we need the scipy.ndimage.filters 
module for computing derivatives using Gaussian derivative filters as described on 
page 18. The reason is again that we would like to suppress noise sensitivity in the 
corner detection process. 

First, add the corner response function to a file harris.py, which will make use of the 
Gaussian derivatives. Again, the parameter cr defines the scale of the Gaussian filters 
used. You can also modify this function to take different scales in the a: and y directions, 
as well as a different scale for the averaging, to compute the Harris matrix. 

from scipy.ndimage import filters 

def compute_harris_response(im,sigma=3): 

Compute the Harris corner detector response function 
for each pixel in a graylevel image. """ 

# derivatives 

imx = zeros(im.shape) 

filters.gaussian_filter(im, (sigma,sigma), ( 0 , 1 ), imx) 
imy = zeros(im.shape) 

filters.gaussian_filter(im, (sigma,sigma), ( 1 , 0 ), imy) 

# compute components of the Harris matrix 
Wxx = filters.gaussian_filter(imx*imx,sigma) 

Wxy = filters.gaussian_filter(imx*imy,sigma) 

Wyy = filters.gaussian_filter(imy*imy,sigma) 

# determinant and trace 
Wdet = Wxx*Wyy - Wxy**2 
lAltr = Wxx + Wyy 

return Wdet / Wtr 
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This gives an image with each pixel containing the value of the Harris response func- 
tion. Now, it is just a matter of picking out the information needed from this image. 
Taking all points with values above a threshold, with the additional constraint that cor- 
ners must be separated with a minimum distance, is an approach that often gives good 
results. To do this, take all candidate pixels, sort them in descending order of corner 
response values, and mark off regions too close to positions already marked as corners. 
Add the following function to harris.py: 


def get_harris_points(harrisim,min_dist=10,threshold=0.l): 

Return corners from a Harris response image 
min_dist is the minimum number of pixels separating 
corners and image boundary. """ 

# find top corner candidales above a threshold 
corner_threshold = harrisim.max() * threshold 
harrisim_t = (harrisim > corner_threshold) * 1 

# get coordinates of candidales 
coords = array(harrisim_t.nonzero()).T 

# ... and their values 

candidate_values = [harrisim[c[o],c[l]] for c in coords] 

# sort candidales 

index = argsort(candidate_values) 

# store allowed point locations in array 
allowed_locations = zeros(harrisim.shape) 
allowed_locations[min_dist:-niin_dist,min_dist:-min_dist] = 1 

# select the best points taking min_distance into account 
filtered_coords = [] 

for i in index: 

if allowed_locations[coords[i,0],coords[i,l]] == 1: 
filtered_coords.append(coords[i]) 

allowed_locations[(coords[i,0]-min_dist):(coords[i,0]+min_dist), 
(coords[i,l]-min_dist):(coords[i,l]+min_dist)] = 0 

return filtered coords 


Now you have all you need to detect corner points in images. To show the corner points 
in the image, you can add a plotting function to harris.py using Matplotlib as follows: 


def plot_harris_points(image,filtered_coords): 

Plots corners found in image. """ 

figureO 

grayO 

imshow(image) 

plot([p[l] for p in filtered_coords],[p[o] for p in filtered_coords],'*') 

axis('off') 

show() 
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Figure 2-1. An example of corner detection with the Harris corner detector: (a) the Harris response 
function; (b-d) corners detected with threshold 0.01, 0.05, and 0.1, respectively. 


Try running the following commands: 

im = array(Image.open( 'empire.jpg' ).convert('L')) 
harrisim = harris.compute_harris_response(im) 
filtered_coords = harris.get_harris_points(harrisim,6) 
harris.plot_harris_points(im, filtered_coords) 

The image is opened and converted to grayscale. Then, the response function is com- 
puted and points selected based on the response values. Finally the points are plotted 
overlaid on the original image. This should give you a plot like the images in Figure 2-1. 

For an overview of different approaches to corner detection, including improvements 
on the Flarris detector and further developments, see for example http://en.wikipedia 
.org/wiki/Corner_detection. 

Finding Corresponding Points Between Images 

The Flarris corner detector gives interest points in images but does not contain an 
inherent way of comparing these interest points across images to find matching cor¬ 
ners. What we need is to add a descriptor to each point and a way to compare such 
descriptors. 

An interest point descriptor is a vector assigned to an interest point that describes 
the image appearance around the point. The better the descriptor, the better your 
correspondences will be. By point correspondence or corresponding points, we mean 
points in different images that refer to the same object or scene point. 

Harris corner points are usually combined with a descriptor consisting of the graylevel 
values in a neighboring image patch together with normalized cross-correlation for 
comparison. An image patch is almost always a rectangular portion of the image 
centered around the point in question. 

In general, correlation between two (equally sized) image patches /i(x) and / 2 (x) is 
defined as 

c(/i, / 2 ) = ^ /(/i(x), / 2 (x)), 

X 
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where the function / varies depending on the correlation method. The sum is taken 
overallpositionsxintheimagepatches. For cross-correlation, the function is fili, If) = 
/j / 2 , and then cf/^, If) = I\ ■ I 2 , with • denoting the scalar product (of the row- or 
column-stacked patches). The larger the value of cf/j, / 2 ), the more similar the patches 
/j and /2 ared 


Normalized cross-correlation is a variant of cross-correlation defined as 


ncc 


ih,h) = —^Y. 

n — \ 


(/l(x) - f^i) (/ 2 (x) - ixf) 


0^1 


^2 


(2.3) 


where n is the number of pixels in a patch, /u-x and 112 are the mean intensities, and 
(Ti and 02 are the Standard deviations in each patch, respectively By subtracting the 
mean and scaling with the Standard deviation, the method becomes robust to changes 
in image brightness. 


To extract image patches and compare them using normalized cross-correlation, you 
need two more functions in harris.py. Add these: 


def get_descriptors(image,filtered_coords,wid=5): 

For each point leturn, pixel values around the point 
using a neighbourhood of width 2*md+l. (Assume points are 
extracted with min_distance > wid). 


desc = [] 

for coords in filtered_coords: 
patch = image[coords[0]-wid:coords[0]+wid+l, 

coords[ 1 ]-wid:coords[l]+wid+l].flatten() 
desc.append(patch) 


return desc 


def match(descl,desc2,threshold=0.5): 

For each corner point descriptor in the first image, 
select its match to second image using 
normalized cross-correlation. """ 


n = len(descl[o]) 

# pair-wise distances 
d = -ones((len(descl),len(desc2))) 
for i in range(len(descl)): 
for j in range(len(desc2)): 
dl = (descl[i] - mean(descl[i])) / std(descl[i]) 
d2 = (desc2[j] - mean(desc2[j])) / std(desc2[j]) 
ncc_value = sum(dl * d2) / (n-l) 
if ncc_value > threshold: 
d[i)j] = ncc_value 

ndx = argsort(-d) 
matchscores = ndx[:,o] 

return matchscores 


^ Another popular function is /(/i, / 2 ) = (/1 — hf, which gives sum ofsquared differences (SSD). 
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The first function takes a square grayscale patch of odd side length centered around the 
point, flattens it, and adds to a list of descriptors. The second function matches each 
descriptor to its best candidate in the other image using normalized cross-correlation. 
Note that the distances are negated before sorting, since a high value means better 
match. To further stabilize the matches, we can match from the second image to the 
first and filter out the matches that are not the best both ways. The following function 
does just that: 


def match_twosided(descl,desc2,threshold=0.5): 
Two-sided symmetiic version of match(). 

matches_12 = match(descl,desc2,threshold) 
matches_21 = match(desc2,descl,threshold) 

ndx _12 = where(matches _12 >= o)[o] 

# remove matches that are not symmetric 
for n in ndx_12: 

if matches_2l[matches_12[n]] != n: 
matches_12[n] = -1 

return matches 12 


The matches can be visualized by showing the images side-by-side and connect- 
ing matched points with lines using the following code. Add these two functions to 
harris.py. 


def appendimages(iml,im2): 

Return a neu image that appends the tuo images side-by-side. 

# select the image uith the feuest rous and fili in enough empty rous 
rowsl = iml.shape[o] 

rows2 = im2.shape[o] 

if rowsl < rows2: 

imi = concatenate((iml,zeros((rows2-rowsl,iml.shape[l]))),axis=0) 
elif rowsl > rows2: 

im2 = concatenate((im2,zeros((rowsl-rows2,im2.shape[l]))),axis=0) 

# if none of these cases they are egual, no filling needed. 

return concatenate((iml,im2), axis=l) 


def plot_matches(iml,im2,locsl,locs2,matchscores,show_below=True): 

Show a figure uith lines joining the accepted matches 
input: iml,im2 (images as arrays), locsl,locs2 (feature locations), 
matchscores (as output from 'match() '), 
shou_belou (if images should be shoun belou matches). 

im3 = appendimages(iml,im2) 
if show_below: 
im3 = vstack((im3,im3)) 
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Figure 2-2. Example of matches resulting from applying normalized cross-correlation to patches 
around Harris corner points. 


imshow(im3) 

colsl = iml.shape[l] 
for i,m in enumerate(matchscores): 
if m>0: 

plot([locsl[i][ 1 ],locs2[m][l]+colsl],[locsl[i][o],locs2[m][ 0 ]],'c') 
axis('off') 

Figure 2-2 shows an example of finding such corresponding points using normalized 
cross-correlation (in this case, with 11 x 11 pixels in a patch) using the following 
commands: 
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wid = 5 

harrisim = harris.coitipute_harris_response(iml,5) 
filtered_coordsl = harris.get_harris_points(harrisim,wid+l) 
dl = harris.get_descriptors(iml,filtered_coordsl,wid) 

harrisim = harris.compute_harris_response(im2,5) 
filtered_coords2 = harris.get_harris_points(harrisim,wid+l) 
d2 = harris.get_descriptors(im2,filtered_coords2,wid) 

print 'starting matching' 

matches = harris.match_twosided(dl,d2) 

figureO 

grayO 

harris.plot_matches(iml,im2,filtered_coordsl,filtered_coords2, matches) 
show() 

If you only want to plot a subset of the matches to make the visualization clearer, 
substitute matches with, for example, matches[:100] or a random set of indices. 

As you can see in Figure 2-2, there are quite a lot of incorrect matches. This is because 
cross-correlation on image patches is not as descriptive as more modern approaches. 
As a consequence, it is important to use robust methods for handling these correspon- 
dences in an application. Another problem is that these descriptors are not invariant to 
scale or rotation, and the choice of patch sizes affects the results. 

In recent years, there has been a lot of development in improving feature point detection 
and description. Lets take a look at one of the best algorithms in the next section. 

2.2 SIFT—Scale-Invariant Feature Transform 

One of the most successful local image descriptors in the last decade is the Scale- 
Invariant Feature Transform (SIFT), introduced by David Lowe in [17]. SIFT was later 
refined and described in detail in the paper [18] and has stood the test of time. SIFT 
includes both an interest point detector and a descriptor. The descriptor is very robust 
and is largely the reason behind the success and popularity of SIFT. Since its introduc- 
tion, many alternatives have been proposed with essentially the same type of descriptor. 
The descriptor is nowadays often combined with many different interest point detec- 
tors (and region detectors for that matter) and sometimes even applied densely across 
the whole image. SIFT features are invariant to scale, rotation, and intensity and can be 
matched reliably across 3D viewpoint and noise. A brief overview is available online 
at http://en.wikipedia.org/wiki/Scale-invariant_feature_transform. 

Interest Points 

SIFT interest point locations are found using difference-of-Gaussian functions 

D(x, a) = [G^,(x) - G„(x)] * /(x) = [G^„ - G„] * 7 = - /„, 

where is the Gaussian 2D kernel described on page 16, the G^j-blurred grayscale 
image, and k a constant factor determining the separation in scale. Interest points are the 
maxima and minima of D(x, cr) across both image location and scale. These candidate 
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locations are filtered to remove unstable points. Points are dismissed based on a number 
of criteria, like low contrast and points on edges. The details are in the paper. 

Descriptor 

The interest point (keypoint) locator above gives position and scale. To achieve invari- 
ance to rotation, a reference direction is chosen based on the direction and magnitude 
of the image gradient around each point. The dominant direction is used as reference 
and determined using an orientation histogram (weighted with the magnitude). 

The next step is to compute a descriptor based on the position, scale, and rotation. 
To obtain robustness against image intensity, the SIFT descriptor uses image gradients 
(compare that to normalized cross-correlation above, which uses the image intensities). 
The descriptor takes a grid of subregions around the point and for each subregion 
computes an image gradient orientation histogram. The histograms are concatenated 
to form a descriptor vector. The Standard setting uses 4x4 subregions with 8 bin 
orientation histograms, resulting in a 128 bin histogram (4*4*8= 128). Figure 2-3 
illustrates the construction of the descriptor. The interested reader should look at [18] 
for the details or http://en.wikipedia.org/wiki/Scale-invariant_feature_transform for an 
OverView. 

Detecting Interest Points 

To compute SIFT features for images, we will use the binaries available with the open 
source package VLFeat [36]. A full Python implementation of all the steps in the 
algorithm would not be very efficient and really is outside the scope of this book. VLFeat 
is available at http://www.vlfeat.org/, with binaries for all major platforms. The library 
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Figure 2-3. An illustration ofthe construction of the feature vector for the SIFT descriptor: (a) aframe 
around an interest point, oriented according to the dominant gradient direction; (b) an 8-bin histogram 
over the direction ofthe gradient in a part of the grid; (c) histograms are extracted in each grid location; 
(d) the histograms are concatenated toform one long feature vector. 
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is written in C but has a command line interface that we can use. There is also a Matlab 
interface and a Python wrapper {http://github.com/mmmikael/vlfeat/) if you prefer that 
to the binaries used here. The Python wrapper can be a little tricky to install on some 
platforms due to its dependencies, so we will focus on the binaries instead. There is 
also an alternative SIFT implementation available at Lowes website, http://www.cs.ubc 
.ca/~lowe/keypoints/ (Windows and Linux only). 

Create a file sift.py and add the following function that calls the executable: 

def process_image(imagename,resultname,params="--edge-thresh 10 --peak-thresh 5"): 
Process an image and save the results in a file. """ 

if imagename[- 3 : ] != 'pg"''- 
# create a pgm file 

im = Image.open(iniagename).convert(' L') 
im.save('tmp.pgm') 
imagename = 'tmp.pgm' 

cmmd = str("sift "+imagename+" --output="+resultname+ 

" "+params) 
os.system(cmmd) 

print 'processed', imagename, 'to', resultname 

The binaries need the image in grayscale .pgm format, so if another image format is 
used, we first convert to a temporary .pgm file. The resuit is stored in a text file in an 
easy-to-read format. The files look something like this: 

318.861 7.48227 1.12001 1.68523 000100000 11 16 0 ... 

318.861 7.48227 1.12001 2.99965 11 2 0 0 1 0 0 0 173 67 0 0 ... 

54.2821 14.8586 0.895827 4.29821 60 46 0 0 0 0 0 0 99 42 0 0 ... 

155.714 23.0575 1.10741 1.54095 6 0 0 0 150 11 0 0 150 18 2 1 ... 

42.9729 24.2012 0.969313 4.68892 90 29 0 0 0 1 2 10 79 45 5 11 ... 

229.037 23.7603 0.921754 1.48754 3 0 0 0 141 31 0 0 141 45 0 0 ... 

232.362 24.0091 1.0578 1.65089 11 1 0 16 134 0 0 0 106 21 16 33 ... 

201.256 25.5857 1.04879 2.01664 10 4 1 8 14 2 1 9 88 13 0 0 ... 


Here, each row contains the coordinates, scale, and rotation angle for each interest point 
as the first four values, followed by the 128 values of the corresponding descriptor. The 
descriptor is represented with the raw integer values and is not normalized. This is 
something you will want to do when comparing descriptors. More on that later. 

The example above shows the first part of the first eight features found in an image. 
Note that the two first rows have the same coordinates but different rotation. This can 
happen if several strong directions are found at the same interest point. 

Heres how to read the features to NumPy arrays from an output file like the one above. 
Add this function to sift.py. 

def read_features_from_file(filename): 

""" Read feature properties and return in matrix form. """ 

f = loadtxt(filename) 

return f [:, :4],f [: ,4: ] tt feature locations, descriptors 
Here we used the NumPy function loadtxt() to do all the work for us. 
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If you modify the descriptors in your Python session, writing the resuit back to feature 
files can be useful. The function below does this for you using NumPy s savetxt(): 

def write_features_to_file(filename,locs,desc): 

Save feature location and descriptor to file. """ 
savetxt(filename,hstack((locs,desc))) 

This uses the function hstackf) that horizontally stacks the two arrays by concatenating 
the rows so that the descriptor part comes after the locations on each row. 

Having read the features, visualizing them by plotting their locations in the image is a 
simple task. Just add plot_features() as below to your file sift.py. 

def plot_features(im,locs,circle=False): 

Show image with features. input: im (image as array)j 
locs (row, coi, scale, orientation of each feature). """ 

def draw_circle(c,r): 
t = arange(0,1.01,.0l)*2*pi 
X = r^cosft) + c[o] 
y = r*sin(t) + c[l] 
plot(x,y,'b',linewidth=2) 

imshow(im) 
if circle: 
for p in locs: 
draw_circle(p[:2],p[2]) 
else: 

plot(locs[:,0],locs[:,1],'ob') 
axis('off') 

This will plot the location of the SIFT points as blue dots overlaid on the image. If the 
optional parameter circle is set to “True”, circles with radius equal to the scale of the 
feature will be drawn instead using the helper function draw_circle(). 

The following commands will create a plot like the one in Figure 2 - 4 b with the SIFT 
feature locations shown: 

import sift 
imname = 'empire.jpg' 

imi = array(Image.open(imname).convert('L')) 

sift.process_image(imname,'empire.sift') 

lljdl = sift.read_features_from_file('empire.sift') 

figureO 

grayO 

sift.plot_features(imi,ll,circle=True) 
show() 

To see the difference compared to Flarris corners, the Flarris corners for the same image 
are shown to the right (Figure 2 - 4 c). As you can see, the two algorithms select different 
locations. 
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Figure 2-4. An example of extracting SIFT features for an image: (a) SIFT features; (b) SIFT features 
shown with circle indicating the scale of the feature; (c) Harris pointsfor the same image for comparison. 


Matching Descriptors 

A robust criteria (also introduced by Lowe) for matching a feature in one image to a 
feature in another image is to use the ratio of the distance to the two closest match¬ 
ing features. This ensures that only features that are distinet enough compared to the 
other features in the image are used. As a consequence, the number of false matehes is 
lowered. Heres what this matching function looks like in code. Add matehO to sift.py: 

def match(descl,desc2): 

For each descriptor in the first image, 
select its mateh in the second image. 
input: descl (descriptors for the first image), 
desc2 (same for second image). """ 

descl = array([d/linalg.norm(d) for d in descl]) 
desc2 = array([d/linalg.norm(d) for d in desc2]) 

dist_ratio = 0.6 
descl_size = descl.shape 

matchscores = zeros((descl_size[o],l),'int') 
desc2t = desc2.T # precompute matrix transpose 
for i in range(descl_size[o]): 
dotprods = dot(descl[i,:],desc2t) # vector of dot products 
dotprods = O.9999*dotprods 

# inverse cosine and sort, return index for features in second image 
indx = argsort(arccos(dotprods)) 

# check if nearest neighbor has angle less than dist_ratio times 2nd 
if arccos(dotprods)[indx[o]] < dist_ratio * arccos(dotprods)[indx[l]]: 

matchscores[i] = int(indx[o]) 

return matchscores 
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This function uses the angle between descriptor vectors as distance measure. This 
makes sense only after we have normalized the vectors to unit length.^ Since the 
matching is one-sided, meaning that we are matching each feature to all features in the 
other image, we can pre-compute the transpose of the matrix containing the descriptor 
vectors containing the points in the second image, so that we don’t have to repeat this 
exact same operation for each feature. 

To further increase the robustness of the matches, we can reverse the procedure and 
match the other way (from the features in the second image to features in the first) and 
only keep the correspondences that satisfy the matching criteria both ways (same as we 
did for the Harris points). The function match_twosided() does just this: 


def match_twosided(descl,desc2): 

Two-sided symmetric version of match(). """ 

matches_12 = match(descl,desc2) 
matches_21 = match(desc2,descl) 

ndx_12 = matches_12.nonzero()[0] 

# remove matches that are not symmetric 
for n in ndx_12: 

if niatches_2l[int(matches_12[n])] != n: 
matches_12[n] = 0 

return matches_12 

To plot the matches, we can use the same functions used in harris.py. Just copy the 
functions appendimages() and plot_matches() and add them to sift.py for convenience. 
You could also import harris.py and use them from there if you like. 

Figures 2-5 and 2-6 show some examples of SIFT feature points detected in image pairs 
together with pair-wise matches returned from the function match_twosided(). 

Figure 2-7 shows another example of matching features found in two images using 
match0 and match_twosided(). As you can see, using the symmetric (two-sided) match¬ 
ing condition removes the incorrect matches and keeps the good ones (some correct 
matches are also removed). 

With detection and matching of feature points, we have everything needed to apply 
these local descriptors to a number of applications. The coming two chapters will add 
geometric constraints on correspondences in order to robustly filter out the incorrect 
ones and apply local descriptors to examples such as automatic panorama creation, 
camera pose estimation, and 3 D structure computation. 


^ In the case of unit length vectors, the scalar product (without the arccos()) is equivalent to the Standard 
Euclidean distance. 
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Figure 2-7. An example of matching SIFT features between two images: (a) matches from features in 
the left image without using the two-sided match function; (b) the remaining matches after using the 
two-sided version. 


2.3 Matching Geotagged Images 

Lets end this chapter by looking at an example application of using local descriptors 
for matching images with geotags. 


Downloading Geotagged Images from Panoramio 

One source of geotagged images is the photo-sharing Service Panoramio {http://www 
.panoramio.com/), owned by Google. Like many web Services, Panoramio has an API 
to access content programmatically Their API is simple and straightforward and is 
described at http://www.panoramio.com/api/. By making an HTTP GET call to a uri 
like this: 

http://www.panoramio.com/map/get_panoramas.php?order=popularity&set=public& 
from=08ito=20&minx=-l80&miny=-90&maxx=l80&maxy=90&size=medium 

where minx, miny, maxx, maxy define the geographic area from which to select photos, 
(minimum longitude, latitude, maximum longitude and latitude, respectively), you 
will get the response in easy-to-parse JSON format. JSON is a common format for data 
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transfer between web Services and is more lightweight than XML and other alternatives. 
You can read more about JSON at http://en.wikipedia.org/wiki/JSON. 

An interesting location with two distinet views is the White House in Washington, D.C., 
which is usually photographed from Pennsylvania Avenue from the south side or from 
the north. The coordinates (latitude, longitude) are: 

lt=38.897661 
ln=-77 .036564 

To convert to the format needed for the API call, subtract and add a number from these 
coordinates to get all images within a square centered around the White house. The call: 

http://www.panoramio.com/map/get_panoramas.php?order=popularity8iset=public& 
f rom=0&to=208iminx=-77.037564&miny=38.8966628imaxx=-77.035564&maxy=38.898662& 
size=medium 

returns the first 20 images within the coordinate bounds (±0.001), ordered according 
to popularity The response looks something like this: 

{ "count": 349, 

"photos": [{"photo_id": 7715073, "pho'to_title'': "White House", "photo_url": 

"http: //www.panoramio.com/photo/7715073", "photo_file_url": 

"http://mw 2 .google.com/mw-panoramio/photos/medium/7715073.jpg", "longitude": 
- 77 . 036583 , "latitude": 38.897488, "width": 500, "height": 375, "upload_date": 

"10 Fehruary 2008", "owner_id": 1213603 , "owner_name'': "***", "owner_url": 
"http://www.panoramio.com/user/i 2 i 3603 "} 

> 

{"photo_id": 1303971, "photo_title": "White House halcony", ’'photo_url": 

"http://www.panoramio.com/photo/1303971", "photo_file_url": 

"http://mw2.google.com/mw-panoramio/photos/medium/1303971.jpg", "longitude": 
- 77 . 036353 , "latitude": 38.897471, "width": 500, "height": 336, "upload_date": 

"13 March 2007", "owner_id": 195000, "owner_name": "***", "owner_url": 

"http://www.panoramio.com/user/l95000"} 


]} 

To parse this JSON response, we can use the simplejson package, which is available 
at http://github.com/simpleison/simplejson. There is online documentation available on 
the project page. 

If you are running Python 2.6 or later, there is no need to use simplejson as there is a 
JSON library included with these later versions of Python. To use the built-in one, just 
import like this: 

import json 

If you want to use simplejson where available (it is faster and could contain newer 
features than the built-in one), a good idea is to import with a fallback, like this: 

try: import simplejson as json 
except ImportError: import json 
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The following code will use the urllib package that comes with Python to handle the 
requests and then parse the resuit using simplejson: 

import os 

import urllib, urlparse 
import simplejson as json 

# query for images 

uri = 'http://www.panoramio.com/map/get_panoramas.php?order=popularity&\ 
set=public&from=08ito=20&minx=-77.037564&miny=38.896662&\ 
maxx=-77. 035564 &maxy= 38 . 8986628 isize=medium' 
c = urllib.urlopen(url) 

# get the uris of individual images from JSON 
j = json.loads(c.read()) 

imurls = [] 

for im in j['photos']: 
imurls.append(im['photo_file_url']) 

# dounload images 
for uri in imurls: 

image = urllib.URLopener() 

image.retrieve(url, os.path.basename(urlparse.urlparse(uri).path)) 
print 'downloading:', uri 

As you can easily see by looking at the JSON output, it is the “photo_file_urr field we 
are after. Running the code above, you should see something like this in your console: 

downloading: http://mw2.google.com/mw-panoramio/photos/medium/77l5073.jpg 
downloading: http://mw2.google.com/mw-panoramio/photos/medium/l30397l.jpg 
downloading: http://mw 2 .google.com/mw-panoramio/photos/medium/ 270077 .jpg 
downloading: http://mw 2 .google.com/mw-panoramio/photos/medium/ 15502 .j pg 


Figure 2-8 shows the 20 images returned for this example. Now we just need to find 
and match features between pairs of images. 

Matching Using Local Descriptors 

Having downloaded the images, we now need to extract local descriptors. In this case, 
we will use SIFT descriptors as described in the previous section. Let s assume that the 
images have been processed with the SIFT extraction code and the features are stored 
in files with the same name as the images (but with file ending “.sift” instead of “.jpg”). 
The lists imlist and featlist are assumed to contain the filenames. We can do a pairwise 
matching between all combinations as follows: 

import sift 

nbr_images = len(imlist) 

matchscores = zeros((nbr_images,nbr_images)) 
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for i in range(nbr_images): 

for j in range(i,nbr_images): # only compute upper triangle 
print 'comparing imlist[i], imlist[j] 

11, dl = sift.read_features_from_file(featlist[i]) 

12, d2 = sift.read_features_from_file(featlist[j]) 

matches = sift.match_twosided(dl,d2) 

nbr_matches = sum(matches > o) 

print 'number of matches = nbr_matches 

matchscores[i,j] = nbr_matches 

# copy values 

for i in range(nbr_images): 

for j in range(i+l,nbr_images): # no need to copy diagonal 
matchscores[j,i] = matchscores[i,j] 

We store the number of matching features between each pair in matchscores. The last 
part of copying the values to fili the matrix completely is not necessary since this 
“distance measure” is symmetric; it just looks better that way The matchscores matrix 
for these particular images looks like this: 



Figure 2-8. Images taken at the same geographic location (square region centered around the White 
house) downloaded from panoramio.com. 
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662 002000010012030 19 102 
0 901 010001100100000012 
00 266 00000000001000000 
210 1481 0022000220002320 
0000 1748 001000002000001 
00000 1747 00100000000110 
000200 555 0001440200510 
0102100 2206 000100102011 
11000100 629 0000000100 20 
000000000 829 0010000002 
0000001000 1025 000001110 
11020041000 528 52 15 03600 
200200400105 736 1403 37 10 
0010200000021 620 100100 
30000021000 15 41 553 06910 
000000000000000 2273 0 1 0 0 
19 002000210133060 542 000 
100301500016 37 1910 527 30 
010201110010101003 1139 0 
22001001 20 2000000000 499 

Using this as a simple distance measure between images (images with similar content 
have higher number of matching features), we can now connect images with similar 
visual content. 

Visualizing Connected Images 

Lets visualize the connections between images defined by them having matching local 
descriptors. To do this, we can show the images in a graph with edges indicating 
connections. We will use the pydot package {http://code.google.eom/p/pydot/), which is 
a Python interface to the powerful GraphViz graphing library Pydot uses Pyparsing 
{http://pyparsing.wikispaces.com/) and GraphViz {http://www.graphviz.org/), but don’t 
worry; all of them are easy to install in just a few minutes. 

Pydot is very easy to use. The following code snippet illustrates this nicely by creating 
a graph illustrating a tree with depth two and branching factor five adding numbering 
to the nodes. The graph is shown in Figure 2 - 9 . There are many ways to customize 
the graph layout and appearance. For more details, see the Pydot documentation or 
the description of the DOT language used by GraphViz at http://www.graphviz.org/ 
Documentation.php. 

import pydot 

g = pydot.Dot(graph_type='graph') 

g.add_node(pydot.Node(str(o),fontcolor='transparent')) 
for i in range(5): 
g.add_node(pydot.Node(str(i+l))) 
g.add_edge(pydot.Edge(str(o), str(i+l))) 
for j in range(5): 

g.add_node(pydot.Node(str(j+l)+'-'+str(i+l))) 
g.add_edge(pydot.Edge(str(j+l)+'-'+str(i+l),str(j+l))) 
g.write_png('graph.jpg',prog='neato') 
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Figure 2-9. An example ofusing pydot to create graphs. 

Lets get back to our example with the geotagged images. To create a graph showing 
potential groups of images, we create an edge between nodes if the number of matches 
is above a threshold. To get the images in the graph, you need to use the full path of 
each image (represented by the variable path in the example below). To make it look 
nice, we also scale each image to a thumbnail with largest side 100 pixels. Heres how 
to do it: 

import pydot 

threshold = 2 # min number of matches needed to create link 

g = pydot.Dot(graph_type='graph') # don't want the default directed graph 
for i in range(nbr_images): 
for j in range(i+l,nbr_images): 
if niatchscores[i,j] > threshold: 

# first image in pair 

im = Image.open(imlist[i]) 
im.thumbnail((l00,100)) 
filename = str(i)+'.png' 

ini.save(filename) # need temporary files of the right size 
g.add_node(pydot.Node(str(i),fontcolor='transparent', 
shape='rectangle',image=path+filename)) 

# second image in pair 

im = Image.open(imlist[j]) 
im.thumbnail((l00,100)) 
filename = str(j)+'.png' 

im.save(filename) # need temporary files of the right size 
g.add_node(pydot.Node(str(j),fontcolor='transparent', 
shape='rectangle',image=path+filename)) 

g.add_edge(pydot.Edge(str(i),str(j))) 

g.write_png('whitehouse.png') 
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The resuit should look something like Figure 2 - 10 , depending on which images you 
download. For this particular set, we see two groups of images, one from each side of 
the White Flouse. 

This applicatiori was a very simple example of using local descriptors for matching 
regions between images. For example, we did not use any verification on the matches. 



Figure 2-10. An example of grouping images taken at the same geographic location using local 
descriptors. 
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This can be done (in a very robust way) using concepts that we will define in the coming 
two chapters. 


Exercises 

1 . Modify the function for matching Harris corner points to also take a maximum 
pixel distance between points for them to be considered as correspondences, in 
order to make matching more robust. 

2 . Incrementally apply stronger blur (or ROF de-noising) to an image and extract 
Harris corners. What happens? 

3 . An alternative corner detector to Harris is the FAST corner detector. There are a 
number of implementations, including a pure Python version available at http:ll 
www.edwardrosten.com/work/fast.html. Try this detector, play with the sensitivity 
threshold, and compare the corners with the ones from our Harris implementation. 

4 . Create copies of an image with different resolutions (for example, by halving the 
size a few times). Extract SIFT features for each image. Plot and match features to 
get a feel for how and when the scale independence breaks down. 

5 . The VLFeat command line tools also contain an implementation of Maximally 
Stable Extremal Regions (MSER) ( http://en.wikipedia.org/wiki/Maximally_stable_ 
extremal_regions) a region detector that finds blob-like regions. Create a function 
for extracting MSER regions and pass them to the descriptor part of SIFT using the 
--read-frames option and one function for plotting the ellipse regions. 

6 . Write a function that matches features between a pair of images and estimates the 
scale difference and in-plane rotation of the scene, based on the correspondences. 

7 . Download images for a location of your choice and match them as in the White 
House example. Can you find a better criteria for linking images? How could you 
use the graph to choose representative images for geographic locations? 
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CHAPTER3 


Image to Image Mappings 


This chapter describes transformations between images and some practical methods for 
computing them. These transformations are used for warping and image registration. 
Finally, we look at an example of automatically creating panoramas. 

3.1 Homographies 

A homography is a 2 D projective transformation that maps points in one plane to 
another. In our case, the planes are images or planar surfaces in 3 D. Homographies have 
many practical uses, such as registering images, rectifying images, texture warping, and 
creating panoramas. We will make frequent use of them. In essence, a homography H 
maps 2 D points (in homogeneous coordinates) according to 

~ x' r hi h2 /i3 n r X “ 

y' = /t5 /tg y or x' = Htl. 

_w' A \_h-j hg hg J \_w _ 

Homogeneous coordinates are a useful representation for points in image planes (and 
in 3 D, as we will see later). Points in homogeneous coordinates are only defined up 
to scale so that x = [x, y, w] = [ax, ay, aw] = [x/w, y/w, 1 ] all refer to the same 2 D 
point. As a consequence, the homography H is also only defined up to scale and has 
eight independent degrees of freedom. Often points are normalized with u) = 1 to have 
a unique identification of the image coordinates x, y. The extra coordinate makes it 
easy to represent transformations with a single matrix. 

Create a file homography.py and add the following functions to normalize and convert 
to homogeneous coordinates: 

def nornialize(points): 

Normalize a collectiori of points in 
homogeneous coordinates so that last rou = 1. 


53 



for row in points: 

row /= points[-l] 
return points 


def make_homog(points): 

Convert a set of points (dim*n array) to 
homogeneous coordinates. """ 


return vstack((points,ones((l,points.shape[l])))) 


When working with points and transformations, we will store the points column-wise 
so that a set of n points in two dimensions will be a 3 x « array in homogeneous 
coordinates. This format makes matrix multiplications and point transforms easier. 
For all other cases, we will typically use rows to store data, for example features for 
clustering and classification. 

There are some important special cases of these projective transformations. An affine 
transformation 
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preserves w = \ and cannot represent as strong deformations as a full projective trans¬ 
formation. The affine transformation contains an invertible matrix A and a translation 
vector t = [fjc, ty\. Affine transformations are used, for example, in warping. 

A similarity transformation 
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is a rigid 2 D transformation that also includes scale changes. The scalar s specifies 
scaling, R is a rotation of an angle d, and t=[t^, ty] is again a translation. With i = 1 
distances are preserved and it is then a rigid transformation. Similarity transformations 
are used, for example, in image registration. 

Lets look at algorithms for estimating homographies and then go into examples of 
using affine transformations for warping, similarity transformations for registration, 
and finally full projective transformations for creating panoramas. 


The Direct Linear Transformation Algorithm 

Homographies can be computed directly from corresponding points in two images 
(or planes). As mentioned earlier, a full projective transformation has eight degrees of 
freedom. Each point correspondence gives two equations, one each for the v and y 
coordinates, and therefore four point correspondences are needed to compute H. 

The direct linear transformation (DLT) is an algorithm for computing H given four 
or more correspondences. By rewriting the equation for mapping points using H for 
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several correspondences, we get an equation like 
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or Ah = 0 where A is a matrix with twice as many rows as correspondences. By stacking 
all corresponding points, a least squares solution for H can be found using singular 
value decomposition (SVD). Heres what it looks like in code. Add the function below 
to homography.py. 

def H_from_points(fp,tp): 

Find homography Hj such that fp is mapped to tp 
using the linear DLT method. Points are conditioned 
automatically. """ 

if fp.shape != tp.shape: 

raise RuntimeError('number of points do not match') 

# condition points (important for numerical reasons) 

# --from points-- 

m = mean(fp[:2], axis=l) 

maxstd = max(std(fp[:2], axis=l)) + le-9 

Cl = diag([l/maxstd, l/maxstd, l]) 

Cl[o][ 2 ] = -m[o]/maxstd 
Cl[l][2] = -m[l]/maxstd 
fp = dot(Cl,fp) 

# --to points-- 

m = mean(tp[:2], axis=l) 

maxstd = max(std(tp[:2], axis=l)) + le-9 

C2 = diag([l/maxstd, l/maxstd, l]) 

C 2 [o][ 2 ] = -m[o]/maxstd 
C2[l][2] = -m[l]/maxstd 

tp = dot(C2,tp) 

# create matrix for linear method, 2 rows for each correspondence pair 
nbr_correspondences = fp.shape[l] 

A = zeros((2*nbr_correspondences,9)) 
for i in range(nbr_correspondences): 

A[2*i] = [-fp[0][i],-fp[l][i],-l,0,0,0, 

tp[o][i]*fp[o][i],tp[o][i]Hp[i][i],tp[o][i]] 

A[2H+1] = [o,o,o,-fp[o][i],-fp[i][i],-i, 

tp[i][i]*fp[o][i],tp[i][i]*fp[i][i],tp[i][i]] 

LI,S,V = linalg.svd(A) 

H = V[8].reshape((3,3)) 
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# decondition 

H = dot(linalg.inv(C2),dot(H,Cl)) 

# nomalize and leturn 
return H / H[2,2] 

The first thing that happens in this function is a check that the number of points are 
equal. If not, an exception is thrown. This is useful for writing robust code, but we will 
only use exceptions in very few cases in this book to make the code samples simpler 
and easier to follow. You can read more about exception types at http://docs.python.org/ 
library/exceptions.html and how to use them at http://docs.python.org/tutorial/errors 
.html. 

The points are conditioned by normalizing so that they have zero mean and unit 
Standard deviation. This is very important for numerical reasons, since the stability 
of the algorithm is dependent on the coordinate representation. Then the matrix A is 
created using the point correspondences. The least squares solution is found as the last 
row of the matrix V of the SVD. The row is reshaped to create H. This matrix is then 
de-conditioned and normalized before being returned. 


Affine Transformations 

An affine transformation has six degrees of freedom and therefore three point corre¬ 
spondences are needed to estimate H. Affine transforms can be estimated using the 
DLT algorithm above by setting the last two elements equal to zero, hy = hg = 0 . 

Here we will use a different approach, described in detail in [ 13 ] (page 130 ). Add the 
following function to homography.py, which computes the affine transformation matrix 
from point correspondences: 

def Haffine_from_points(fp,tp): 

Find H, affine transformation, such that 
tp is affine transf of fp. """ 

if fp.shape != tp.shape: 

raise RuntimeError('number of points do not match') 

# condition points 
tf --from points-- 

m = mean(fp[:2], axis=l) 

maxstd = max(std(fp[:2], axis=l)) + le-9 

Cl = diag([l/maxstd, l/maxstd, l]) 

Cl[o][ 2 ] = -m[o]/maxstd 
Cl[l][2] = -m[l]/maxstd 
fp_cond = dot(Cl,fp) 

# --to points-- 

m = mean(tp[:2], axis=l) 

C2 = Cl.copyO It must use same scaling for both point sets 
C 2 [o][ 2 ] = -m[o]/maxstd 
C2[l][2] = -m[l]/maxstd 
tp_cond = dot(C2,tp) 
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# conditioned points have mean zero, so tianslation is zero 
A = concatenate((fp_cond[:2],tp_cond[:2]), axis=o) 

LI,S,V = linalg.svd(A.T) 

# create B and C matrices as Hartley-Zisserman (2:nd ed) p 130. 
tmp = V[:2].T 

B = tnip[:2] 

C = tmp[2:4] 

tmp2 = concatenate((dot(C,linalg.pinv(B)),zeros((2,l))), axis=l) 

H = vstack((tmp2,[ 0 , 0 , 1 ])) 

# decondition 

H = dot(linalg.inv(C2),dot(H,Cl)) 
return H / H[2,2] 

Again, the points are conditioned and de-conditioned as in the DLT algorithm. Lets 
see what these affine transformations can do with images in the next section. 

3.2 Warping Images 

Applying an affine transformation matrix H on image patches is called warping (or 
affine warping) and is frequently used in computer graphics but also in several computer 
Vision algorithms. A warp can easily be performed with SciPy using the ndimage package. 
The command 

transformed_im = ndimage.affine_transforni(im,A,b,size) 

transforms the image patch im with A a linear transformation and b a translation vector 
as above. The optional argument size can be used to speci fy the size of the output image. 
The default is an image with the same size as the original. To see how this works, try 
running the following commands: 

from scipy import ndimage 

im = array(Image.open( 'empire.jpg' ).convert('L')) 

H = array([[l. 4 , 0 . 05 ,- 100 ],[ 0 . 05 , 1 . 5 ,- 100 ],[ 0 , 0 , 1 ]]) 

im2 = ndimage.affine_transform(im,H[:2,:2],(H[o,2],H[l,2])) 

figureO 

grayO 

imshow(im2) 

show() 

This gives a resuit like the image to the right in Figure 3 T. As you can see, missing pixel 
values in the resuit image are filled with zeros. 


Image in Image 

A simple example of affine warping is to place images, or parts of images, inside another 
image so that they line up with specific areas or landmarks. 
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Figure 3-1. An example of warping an image using an affine transform: original (left), image after 
warping with ndimage.affine_transform() (right). 


Add the function image_in_image() to warp.py. This function takes two images and the 
corner coordinates of where to put the first image in the second: 

def image_in_image(iml,iiii2,tp): 

Put imi in im2 with an affine transfomation 
such that corners are as close to tp as possible. 
tp are homogeneous and counterclockwise from top left. 

# points to warp from 
m,n = iml.shape[:2] 

fp = array([[0,m,m,0],[0,0,n,n],[l,l,l,l]]) 

# compute affine transform and apply 

H = homography.Haffine_from_points(tp,fp) 
iml_t = ndimage.affine_transform(iml,H[:2,:2], 

(H[o,2],H[l,2]),im2.shape[:2]) 
alpha = > o) 

return (l-alpha)*im2 + alpha*iml_t 

As you can see, there is not much needed to do this. When blending together the warped 
image and the second image, we create an alpha map that defines how much of each 
pixel to take from each image. Here we use the fact that the warped image is filled 
with zeros outside the borders of the warped area to create a binary alpha map. To be 
really striet, we could have added a small number to the potential zero pixels of the first 
image, or done it properly (see exercises at the end of the chapter). Note that the image 
coordinates are in homogeneous form. 

To try this function, lets insert an image on a billboard in another image. The following 
lines of code will put the leftmost image of Figure 3-2 into the second image. The 
coordinates were determined manually by looking at a plot of the image (in PyLab 
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Figure 3-2. An example of placing an image inside another image using an affine transformation. 


figures, the mouse coordinates are shown near the bottom). PyLabs ginput() could, 
of course, also have been used. 

import warp 

# example of affine warp of imi onto im2 

imi = array(Image.open(' beatles.jpg' ).convert('L')) 

im2 = array(Image.open('billboard_for_rent.jpg').convert('L')) 

# set to points 

tp = array([[264,538,540,264],[40,36,605,605],[l,l,l,l]]) 

im3 = warp.image_in_image(iml,im2,tp) 

figureO 
grayO 
imshow(im3) 
axis('equal') 
axis('off') 
show() 

This puts the image on the upper part of the billboard. Note again that the landmark 
coordinates tp are in homogeneous coordinates. Changing the coordinates to 

tp = array([[675,826,826,677],[55,52,28l,277],[l,l,l,l]]) 

will put the image on the lower-left “for rent” part. 

The function Haffine_from_points() gives the best affine transform for the given point 
correspondences. In the example above, those were the image corners and the corners of 
the billboard. If the perspective effects are small, this will give good results. The top row 
of Figure 3-3 shows what happens if we try to use an affine transformation to a billboard 
image with more perspective. It is not possible to transform ali four corner points to 
their target locations with the same affine transform (a full projective transform would 
have been able to do this, though). If you want to use an affine warp so that ali corner 
points match, there is a useful trick. 
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Figure 3-3. Comparing an affine warp of thefull image with an affine warp using two triangles. The 
image is placed on a billboard with some perspective effects. Using an affine transform for the whole 
image results in a bad fit. The two right-hand corners are enlarged for clarity (top). Using an affine 
warp consisting of two triangles gives an exactfit (bottom). 


For three points, an affine transform can warp an image so that the three correspon- 
dences match perfectly This is because an affine transform has six degrees of freedom 
and three correspondences give exactly six constraints (x and y coordinates must match 
for all three). So if you really want the image to fit the billboard using affine transforms, 
you can divide the image into two triangles and warp them separately Heres how to 
do it: 

# set from points to corners of imi 
m,n = iml.shape[:2] 

fp = array([[0,m,m,0],[0,0,n,n],[1,1,1,!]]) 

# first triangle 
tp2 = tp[:,:3] 
fp2 = fp[:,:3] 

# compute H 

H = homography.Haffine_from_points(tp2,fp2) 
inil_t = ndiniage.affine_transforni(inil,H[ :2, :2], 

(H[o,2],H[l,2]),im2.shape[:2]) 

# alpha for triangle 

alpha = warp.alpha_for_triangle(tp2,im2.shape[o],im2.shape[l]) 
ini3 = (l-alpha)*im2 + alpha*iml_t 

# second triangle 
tp2 = tp[:,[0,2,3]] 
fp2 = fp[:,[0,2,3]] 
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# compute H 

H = homography.Haffine_from_points(tp2,fp2) 
iml_t = ndimage.affine_transform(iml,H[:2, :2], 

(H[o,2],H[l,2]),im2.shape[:2]) 

# alpha for triangle 

alpha = warp.alpha_for_triangle(tp2,im2.shape[o],im2.shape[l]) 
im4 = (l-alpha)*im3 + alpha*iml_t 

figureO 
grayO 
imshow(im4) 
axis('equal') 
axis('off') 
show() 

Here we simply create the alpha map for each triangle and then merge all images 
together. The alpha map for a triangle can be computed simply by checking if a pixels 
coordinates can be written as a convex combination of the triangles corner points.^ If the 
coordinates can be expressed this way that means the pixel is inside the triangle. Add 
the following function alpha_for_triangle(), which was used in the example above, to 
warp.py. 

def alpha_for_triangle(points,m,n): 

Creates alpha map of size (m,n) 
for a triangle with corners defined by points 
(given in normalized homogeneous coordinates). """ 

alpha = zeros((m,n)) 

for i in range(min(points[o]),max(points[o])): 
for j in range(min(points[l]),max(points[l])): 

X = linalg.solve(points,[ijj)l] ) 
if min(x) > 0: ttall coefficients positive 
alpha[i,j] = 1 
return alpha 

This is an operation your graphics card can do extremely fast. Python is a lot slower 
than your graphics card (or a C/C++ implementation for that matter) but it works just 
fine for our purposes. As you can see at the bottom of Figure 3 - 3 , the corners now 
match. 


Piecewise Affine Warping 

As we saw in the example above, affine warping of triangle patches can be done to 
exactly match the corner points. Lets look at the most common form of warping 
between a set of corresponding points, piecewise affine warping. Given any image with 
landmark points, we can warp that image to corresponding landmarks in another image 
by triangulating the points into a triangle mesh and then warping each triangle with an 


^ A convex combination is a linear combination “j*/ (™ case of the triangle points) such that all 
coefficients aj are non-negative and sum to 1. 
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Figure 3-4. An example of Delaunay triangulation of a set of random 2D points. 


affine transform. These are Standard operations for any graphics and image processing 
library Here we show how to do this using Matplotlib and SciPy. 

To triangulare points, Delaunay triangulation is often used. An implementation of 
Delaunay triangulation comes included in Matplotlib (but outside the PyLab part) and 
can be used like this: 

import matplotlib.delaunay as md 

x,y = array(random.standard_normal((2,100))) 
centers,edges,tri,neighbors = md.delaunay(x,y) 

figureO 
for t in tri: 

t_ext = [t[o], t[l], t[2], t[o]] # add first point to end 
plot(x[t_ext],y[t_ext],'r') 

plot(x,y,'*') 
axis('off') 
show() 

Figure 3-4 shows some example points and the resulting triangulation. Delaunay tri¬ 
angulation chooses the triangles so that the minimum angle of all the angles of the 
triangles in the triangulation is maximized.^ There are four outputs of delaunayf), of 
which we only need the list of triangles (the third of the outputs). Create a function in 
warp.py for the triangulation: 

import matplotlib.delaunay as md 

def triangulate_points(x,y): 

Delaunay triangulation of 2D points. """ 

centers,edges,tri,neighbors = md.delaunay(x,y) 
return tri 


^ The edges are actually the dual graph of a Voronoi diagram. See http://en.wikipedia.org/wiki/Delaunay, 
triangulation. 
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The output is an array with each row containing the indices in the arrays x and y for 
the three points of each triangle. 

Lets now apply this to an example of warping an image to a non-flat object in another 
image using 30 control points in a 5 x 6 grid. Figure 3-5b shows an image to be warped 
to the facade of the “turning torso.” The target points were manually selected using 
ginputO and stored in the file turningtorso_points.txt. 

First, we need a general warp function for piecewise affine image warping. The code 
below does the trick, where we also take the opportunity to show how to warp color 
images (you simply warp each color channel): 


def pw_affine(fromim,toim,fp,tp,tri): 

Warp triangular patches from an image. 
fromim = image to warp 
toim = destination image 
fp = from points in hom. coordinates 
tp = to points in hom. coordinates 
tri = triangulation. 

im = toim.copyO 

# check if image is grayscale or color 
is_color = len(fromim.shape) == 3 

# create image to warp to (needed if iterate colors) 
im_t = zeros(im.shape, 'uintS') 

for t in tri: 

# compute affine transformation 

H = homography.Haffine_from_points(tp[:,t],fp[:,t]) 

if is_color: 

for coi in range(fromim.shape[2]): 
im_t[:,:,col] = ndimage.affine_transform( 
fromim[:,:,col],H[:2,:2],(H[o,2],H[l,2]),im.shape[:2]) 

else: 

im_t = ndimage.affine_transform( 

fromim,H[:2,:2],(H[o,2],H[l,2]),im.shape[:2]) 

# alpha for triangle 

alpha = alpha_for_triangle(tp[:,t],im.shape[o],im.shape[l]) 

# add triangle to image 
im[alpha>0] = im_t[alpha>o] 

return im 


Flere we first check if the image is grayscale or color and in the case of colors, we warp 
each color channel. The affine transform for each triangle is uniquely determined, so 
we use Haffine_froiTi_points(). Add this function to the file warp.py. 
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To use this function on the current example, the following short script puts it all 
together: 

import homography 
import warp 

# open image to warp 

fromim = array(Image.open(' sunset_tree.jpg' )) 
x,y = meshgrid(range(5),range(6)) 

X = (fromim.shape[l]/4) * x.flatten() 
y = (fromim.shape[o]/5) * y.flatten() 

# triangulate 

tri = warp.triangulate_points(x,y) 

# open image and destination points 

im = array(Image.open('turningtorsol.jpg')) 

tp = loadtxt( 'turningtorsol_points.txt ') # destination points 

# convert points to hom. coordinates 
fp = vstack((y,x,ones((l,len(x))))) 

tp = vstack((tp[;,1],tp[:,0],ones((l,len(tp))))) 

# warp triangles 

im = warp.pw_affine(fromim,im,fp,tp,tri) 

# plot 
figureO 
imshow(im) 

warp.plot_mesh(tp[l],tp[o],tri) 

axis('off') 

show() 

The resulting image is shown in Figure 3-5c. The triangles are plotted with the following 
helper function (add this to warp.py ): 

def plot_mesh(x,y,tri): 

Plot triangles. 

for t in tri: 

t_ext = [t[o], t[l], t[2], t[o]] # add first point to end 
plot(x[t_ext],y[t_ext],'r') 

This example should give you all you need to apply piecewise affine warping of images 
to your own applications. There are many improvements that can be made to the 
functions used. Lets leave some to the exercises and the rest to you. 


Registering Images 

Image registration is the process of transferring images so that they are aligned in a 
common coordinate frame. Registration can be rigid or non-rigid and is an important 
step in order to be able to do image comparisons and more sophisticated analysis. 
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(a) (b) (c) (d) 


Figure 3-5. An example of piecewise affine warping using Delaunay triangulated landmark points: 
(a) the target image with landmarks; (b) image with triangulation; (c) with warped image; (d) with 
warped image and triangulation. 


Lets look at an example of rigidly registering a set of face images so that we can compute 
the mean face and face appearance variations in a meaningful way In this type of 
registration we are actually looking for a similarity transform (rigid with scale) to map 
correspondences. This is because the faces are not ali at the same size, position, and 
rotation in the images. 

In the file jkfaces.zip are 366 images of a single person (one for each day in 2008).^ The 
images are annotated with eye and mouth coordinates in the file jkfaces.xml. Using 
the points, a similarity transformation can be computed and the images warped to a 
normalized coordinate frame using this transformation (which, as mentioned, includes 
scaling). To read XML files, we will use minidom that comes with Pythons built-in 
xml.dom module. 

The XML file looks like this: 

<?xml version="l.O" encoding="utf-8''?> 

<faces> 

<face file=''jk-002.jpg" xf="46" xm="56" xs="67" yf="38" ym="65" ys="39"/> 

<face file=’'jk-006.jpg" xf="38" xm="48" xs="59" yf="38" ym="65" ys="38"/> 

<face file=’'jk-004.jpg" xf="40" xm="50" xs="6l" yf="38" ym="66" ys="39"/> 

<face file="jk-010.jpg" xf="33" xm="44" xs="55" yf="38" ym="65" ys="38"/> 


</faces> 

To read the coordinates from the file, add the following function that uses minidom to a 
new file imregistration.py. 


^ Images are courtesy of J. K. Keller (with permission). See http://jk-keller.com/daily-photo/ ior more details. 
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from xml.dom import minidom 

def read_points_from_xml(xmlFileNanie): 

Reads control points for face alignment. """ 

xmldoc = minidom. parse(xmlFileName) 
facelist = xmldoc.getElementsByTagName('face') 
faces = {} 

for xmlFace in facelist: 
fileName = xmlFace.attributes['file'].value 
xf = int(xmlFace.attributes['xf'].value) 
yf = int(xmlFace.attributes['yf'].value) 
xs = int(xmlFace.attributes['xs'].value) 
ys = int(xmlFace.attributes['ys'].value) 
xm = int(xmlFace.attributes['xm'].value) 
ym = int(xmlFace.attributes['ym'].value) 
faces[fileName] = array([xf, yf, xs, ys, xm, ym]) 
return faces 


The landmark points are returned in a Python dictionary with the filename of the 
image as key The format is: xf,yf coordinates of the leftmost eye in the image (the 
persons right), xs,ys coordinates of the rightmost eye, and xm,ym mouth coordinates. 
To compute the parameters of the similarity transformation, we can use a least squares 
solution. For each point x,- = [xj, y, ] (in this case there are three of them), the point 
should be mapped to the target location [xj, y,] as 



Taking all three points, we can rewrite this as a system of equations with the unknowns 
a, b,t^, ty like this: 
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Here we used the parameterization of similarity matrices 
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with scale ^ and rotation matrix R. 
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More point correspondences would work the same way and only add extra rows to the 
matrix. The least squares solution is found using linalg.lstsq(). This idea of using 
least squares Solutions is a Standard trick that will be used many times in this book. 
Actually this is the same as used in the DLT algorithm earlier. 

The code looks like this (add to imregistration.py): 

from scipy import linalg 

def compute_rigid_transform(refpoints,points): 

Computes rotatiooj scale and translation for 
aligning points to refpoints. """ 

A = array([ [points[o], -points[l], 1, o], 

[points[l], points[o], 0, l], 

[points[ 2 ], -points[3], 1, o], 

[points[3], points[ 2 ], 0 , l], 

[points[4], -points[5], 1, o], 

[points[5], points[4], 0 , l]]) 

y = array([ refpoints[o], 
refpoints[l], 
refpoints[2], 
refpoints[3 ], 
refpoints[4], 
refpoints[5]]) 

# least sq solution to mimimize //Ax - y// 
a,b,tx,ty = linalg.lstsq(A,y)[o] 

R = array([[a, -b], [b, a]]) # rotation matrix incl scale 
return R,tx,ty 

The function returns a rotation matrix with scale as well as translation in the x 
and y directions. To warp the images and store new aligned, images we can apply 
ndimage.affine_transform() to each color channel (these are color images). As refer- 
ence frame, any three point coordinates could be used. Here we will use the landmark 
locations in the first image for simplicity: 

from scipy import ndimage 
from scipy.misc import imsave 
import os 

def rigid_alignment(faces,path,plotflag=False): 

Align images rigidly and save as new images. 
path determines uhere the aligned images are saved 
set plotflag=True to plot the images. """ 

# take the points in the first image as reference points 
refpoints = faces.values()[o] 

# warp each image using affine transform 
for face in faces: 

points = faces[face] 
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R,tx,ty = compute_rigid_transform(refpoints, points) 
T = array([[R[l][I], R[l][o]], [R[o][l], R[o][o]]]) 


im = array(Image.open(os.path.join(path,face))) 
im2 = zeros(im.shape, 'uintS') 

# worp each color channel 

for i in range(len(im.shape)): 

im2[:,:,i] = ndimage.affine_transform(im[:,:,i],linalg.inv(T),offset=[-ty,-tx]) 

if plotflag: 
imshow(im2) 
show() 

# crop amy border and save aligned images 
h,w = im2.shape[:2] 

border = (w+h)/20 

# crop away border 

imsave(os.path.join(path, 'aligned/'+face),im2[border:h-border,border:w-border,:]) 


Here we use the imsave() function to save the aligned images to a sub-directory 
“aligned”. 

The following short script will read the XML file containing filenames as keys and 
points as values and then register all the images to align them with the first one: 


import imregistration 

# load the location of control points 
xmlFileName = 'jkfaces2008_small/jkfaces.xml' 

points = imregistration.read_points_from_xml(xmlFileNanie) 

# register 

imregistration.rigid_alignment(points,'jkfaces2008_small/') 

If you run this, you should get aligned face images in a sub-directory Figure 3-6 shows 
six sample images before and after registration. The registered images are cropped 
slightly to remove the undesired black fili pixels that may appear at the borders of 
the images. 

Now lets see how this affects the mean image. Figure 3-7 shows the mean image for 
the unaligned face images next to the mean image of the aligned images (note the size 
difference due to cropping the borders of the aligned images). Although the original 
images show very little variation in size of the face, rotation and position, the effects on 
the mean computation are drastic. 

Not surprisingly using badly registered images also has a drastic impact on the com¬ 
putation of Principal components. Figure 3-8 shows the resuit of PCA on the first 150 
images from this set without and with registration. Just as with the mean image, the 
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Figure 3-6. Sample images before (top) and after rigid registration (bottom). 


PCA-modes are blurry When computing the prindpal components, we used a mask 
consisting of an ellipse centered around the mean face position. By multiplying the im¬ 
ages with this mask before stacking them, we can avoid bringing background variations 
into the PCA-modes. Just replace the line that creates the matrix in the PCA example 
in Section 1.3 (page 14) with 

immatrix = array([mask’''array(Iniage.open(imlist[i]) .convert(' L')) •flattenO 
for i in range(l50)],'f') 

where mask is a binary image of the same size, already flattened. 



Figure 3-7. Comparing mean images: without alignment (left); with three-point rigid alignment (right). 
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Figure 3-8. Comparing PCA-modes of unregistered and registered images: the mean image and the 
first nine principal components without registering the images beforehand (top); the same with the 
registered images (bottom). 


3.3 Creating Panoramas 

Two (or more) images that are taken at the same location (that is, the camera position is 
the same for the images) are homographically related (see Figure 3-9). This is frequently 
used for creating panoramic images where several images are stitched together into one 
big mosaic. In this section we will explore how this is done. 


RANSAC 

RANSAC, short for “RANdom SAmple Consensus,” is an iterative method to fit models 
to data that can contain outliers. Given a model, for example a homography between 
sets of points, the basic idea is that the data contains inliers, the data points that can 
be described by the model, and outliers, those that do not fit the model. 
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Figure 3-9. Five images of the main university building in Luna, Sweden. The images are all taken 
from the same viewpoint 


The Standard example is the case of fitting a line to a set of points that contains 
outliers. Simple least squares fitting will fail, but RANSAC can hopefully single out 
the inliers and obtain the correct fit. Lets look at using ransac.py from http://u/u/u/ 
.scipy.org/Cookbook/RANSAC, which contains this particular example as a test case. 
Figure 3-10 shows an example of running ransac.testf). As you can see, the algorithm 
selects only points consistent with a line model and correctly finds the right solution. 
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Figure 3-10. An example of using RANSAC to fit a line to points with outliers. 
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RANSAC is a very useful algorithm, which we will use in the next section for homog- 
raphy estimation and again for other examples. For more information, see the original 
paper by Fischler and Bolles [11], Wikipedia http://en.wikipedia.org/wiki/RANSAC, or 
the report [40]. 

Robust Homography Estimation 

We can use this RANSAC module for any model. Ali that is needed is a Python class 
with fit() and get_error() methods, the rest is taken care of by ransac.py. Flere we are 
interested in automatically finding a homography for the panorama images using a set 
of possible correspondences. Figure 3-11 shows the matching correspondences found 
automatically using SIFT features, by running the following commands: 

import sift 

featname = ['Llniv'+str(i+l)+'.sift' for i in range(5)] 
imname = ['Llniv'+str(i+l)+'. jpg' for i in range(5)] 

1 = {} 

d = {} 

for i in range(5): 

sift.process_image(imname[i],featname[i]) 
l[i],d[i] = sift.read_features_from_file(featname[i]) 

matches = {} 
for i in range(4): 

matches[i] = sift.match(d[i+l],d[i]) 

It is ciear from the images that not ali correspondences are correct. SIFT is actually a 
very robust descriptor and gives fewer false matches than, for example, Flarris points 
with patch correlation, but stili it is far from perfect. 

To fit a homography using RANSAC, we first need to add the following model class to 
homography.py-. 

class RansacModel(object): 

Class for testing homography fit with ransac.py from 
http ;//www. scipy.org/Cookbook/RANSAC""" 

def _init_(self,debug=False): 

self.debug = debug 

def fit(self, data): 

""" Fit homography to four selected correspondences. 

# transpose to fit H_from_points() 
data = data.T 

# from points 

fp = data[:3,:4] 

# target points 
tp = data[3:,:4] 

ff fit homography and return 
return H_froni_points(fp,tp) 
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def get_error( self, data, H): 

Apply homogiaphy to ali correspondences, 
return error for each transformed point. """ 

data = data.T 

# from points 
fp = data[:3] 

# target points 
tp = data[3:] 

# transform fp 
fp_transformed = dot(H,fp) 

# normalize hom. coordinates 
for i in range(3): 

fp_transformed[i] /= fp_transformed[2] 

# return error per point 

return sqrt( sum((tp-fp_transformed)’''*2,axis=o) ) 

As you can see, this class contains a fit() method that just takes the four correspon¬ 
dences selected by ransac.py (they are the first four in data) and fits a homography Re- 
member, four points are the minimal number to compute a homography The method 


Figure 3-11. Matching correspondences found between consecutive image pairs using SIFT features. 
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get_error() applies the homography and returns the sum of squared distance for each 
correspondence pair, so that RANSAC can chose which points to keep as inliers and 
oudiers. This is done with a threshold on this distance. For ease of use, add the following 
function to homography.py. 

def H_from_ransac(fp,tp,niodel,maxiter=1000,match_theshold=10): 

Robust estimation of homography H from point 
coirespondences using RANSAC (ransac.py from 
http://www.scipy.org/Cookhook/RANSAC). 

input: fp,tp (3*n arrays) points in hom. coordinates. """ 

import ransac 

# group corresponding points 
data = vstack((fp,tp)) 

# compute H and return 

H,ransac_data = ransac.ransac(data.T,model,4,maxiter,match_theshold,10, 

return_all=True) 
return H,ransac_data['inliers'] 

The function also lets you supply the threshold and the minimum number of points 
desired. The most important parameter is the maximum number of iterations: exiting 
too early might give a worse solution; too many iterations will take more time. The 
resulting homography is returned together with the inlier points. 

Apply RANSAC to the correspondences like this: 

# function to convert the matches to hom. points 
def convert_points(j): 

ndx = matches[ j] .nonzeroO [ 0 ] 
fp = homography.make_homog(l[j+l][ndx,:2].T) 
ndx2 = [int(matches[j][i]) for i in ndx] 
tp = homography.make_homog(l[j][ndx2,:2].T) 
return fp,tp 

# estimate the homographies 
model = homography. RansacModelO 

fp,tp = convert_points(l) 

H_12 = homography.H_from_ransac(fp,tp,model)[ 0 ] #im 1 to 2 
fp,tp = convert_points(o) 

H_01 = homography.H_from_ransac(fp,tp,model)[ 0 ] #im 0 to 1 

tp,fp = convert_points(2) #NB: reverse order 

H_32 = homography.H_from_ransac(fp,tp,model)[ 0 ] #im 3 to 2 

tp,fp = convert_points(3) #NB: reverse order 

H_43 = homography.H_from_ransac(fp,tp,model)[ 0 ] #im 4 to 3 

In this example, image number 2 is the Central image and the one we want to warp the 
others to. Image 0 and 1 should be warped from the right and image 3 and 4 from the 
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left. The matches were computed from the rightmost image in each pair; therefore, we 
reverse the order of the correspondences for the images warped from the left. We also 
take only the first output (the homography), as we are not interested in the inlier points 
for this warping case. 


Stitching the Images Together 

With the homographies between the images estimated (using RANSAC), we now need 
to warp all images to a common image plane. It makes most sense to use the plane 
of the center image (otherwise the distortions will be huge). One way to do this is to 
create a very large image, for example filled with zeros, parallel to the Central image and 
to warp all the images to it. Since all our images are taken with a horizontal rotation of 
the camera, we can use a simpler procedure: we just pad the Central image with zeros 
to the left or right to make room for the warped images. Add the following function, 
which handles this to warp.py. 

def panorattia(H,fromini,toim,padding=2400,delta=2400): 

Create horizontal panorama by blending tuo images 
using a homography H (preferably estimated using RANSAC). 

The resuit is an image with the same height as toim. 'padding' 
specifies number of fili pixels and 'delta' additional translation. """ 

# check if images are grayscale or color 
is_color = len(fromim.shape) == 3 

# homography transformation for geometric_transform() 
def transf(p): 

p2 = dot(H,[p[o],p[l],l]) 
return (p2[o]/p2[2],p2[l]/p2[2]) 

if H[l,2]<0: # fromim is to the right 
print 'warp - right' 

# transform fromim 
if is_color: 

# pad the destination image uith zeros to the right 
toim_t = hstack((toim,zeros((toim.shape[o],padding,3)))) 

fromim_t = zeros((toim. shape[o],toim.shape[l]+padding,toim.shape[ 2 ])) 
for coi in range(3): 

fromim_t[:,:,col] = ndimage.geometric_transform(fromim[:,:,col], 
transf,(toim.shape[ 0 ],toim.shape[l]+padding)) 

else: 

# pad the destination image uith zeros to the right 
toim_t = hstack((toim,zeros((toim.shape[o],padding)))) 
fromim_t = ndimage.geometric_transform(fromim,transf, 

(toim.shape[o],toim.shape[l]+padding)) 

else: 

print 'warp - left' 

# add translation to compensate for padding to the left 
H_delta = array([[l,0,0],[0,1,-delta],[0,0,l]]) 

H = dot(H,H_delta) 
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# transform fromim 
if is_color: 

# pad the destination image uith zeros to the left 
toim_t = hstack((zeros((toim.shape[0],padding,3)),toim)) 

fromim_t = zeros((toim.shape[0],toim.shape[l]+padding,toim.shape[2])) 
for coi in range(3): 

fromim_t[:,: ,col] = ndimage.geometric_transform(fromini[:,: ,col], 
transf,(toim.shape[o],toim.shape[l]+padding)) 

else: 

# pad the destination image uith zeros to the left 
toim_t = hstack((zeros((toim.shape[o],padding)),toim)) 
fromim_t = ndimage.geometric_transform(fromim, 

transf,(toim.shape[o],toim.shape[l]+padding)) 

# blend and return (put fromim above toim) 
if is_color: 

# all non black pixels 

alpha = ((fromim_t[:,:,0] * fromim_t[:,:,1] * fromim_t[:,:,2] ) > o) 
for coi in range(3): 

toim_t[;,:,col] = fromim_t[:,:,col]*alpha + toim_t[:,:,col]*(l-alpha) 
else: 

alpha = (fromim_t > o) 

toim_t = fromim_t’''alpha + toim_t’''(l-alpha) 

return toim_t 

For a general geoiTietric_transform(), a function describing the pixel to pixel map needs 
to be specified. In this case, transf () does this by multiplying with H and normalizing 
the homogeneous coordinates. By checking the translation value in H we can decide 
if the image should be padded to the left or the right. When the image is padded to 
the left, the coordinates of the points in the target image changes so in the “left” case 
a translation is added to the homography. For simplicity we also stili use the trick of 
zero pixels for finding the alpha map. 

Now use this function on the images as follows: 

# uarp the images 

delta = 2000 # for padding and translation 

imi = array(Image.open(imname[l])) 
im2 = array(Image.open(imname[2])) 
im_12 = warp.panorama(H_12,imi,im2,delta,delta) 

imi = array(Image.open(imname[o])) 

im_02 = warp.panorama(dot(H_12,H_0l),imi,im_12,delta,delta) 

imi = array(Image.open(imname[3])) 

im_32 = warp.panorama(H_32,imi,im_02,delta,delta) 

imi = array(Image.open(imname[j+l])) 

im_42 = warp.panorama(dot(H_32,H_43),imi,im_32,delta,2*delta) 

Note that, in the last line, im_32 is already translated once. The resulting panorama 
image is shown in Figure 3-12. As you can see, there are effects of different exposure 
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Figure 3-12. Horizontal panorama automatically created from SIFT correspondences: the full 
panorama (top); a crop ofthe Central part (bottom). 


and edge effects at the boundaries between individual images. Commerdal panorama 
Software has extra processing to normalize intensity and smooth transitions to make 
the resuit look even better. 

Exercises 

1. Create a function that takes the image coordinates of a square (or rectangular) 
object (for example a book, a poster, or a 2D bar code) and estimates the transform 
that takes the rectangle to a full on frontal view in a normalized coordinate system. 
Use ginputO or the strongest Harris corners to find the points. 

2. Write a function that correctly determines the alpha map for a warp like the one 
in Figure 3-1. 

3. Find a data set of your own that contains three common landmark points (like 
those in the face example or using a famous object like the Eiffel tower). Create 
aligned images where the landmarks are in the same position. Compute mean and 
median images and visualize them. 

4. Implement intensity normalization and a better way to blend the images in the 
panorama example to remove the edge effects in Figure 3-12. 

5. Instead of warping to a Central image, panoramas can be created by warping on to 
a cylinder. Try this for the example in Figure 3-12. 

6. Use RANSAC to find several dominant homography inlier sets. An easy way to 
do this is to first make one run of RANSAC, find the homography with the 
largest consistent subset, then remove the inliers from the set of matches, then 
run RANSAC again to get the next biggest set, and so on. 
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7. Modify the homography RANSAC estimation to instead estimate affine transfor- 
mations using three point correspondences. Use this to determine if a pair of images 
contains a planar scene, for example using the inlier count. A planar scene will have 
a high inlier count for an affine transformation. 

8. Build a panograph {http://en.wikipedia.org/wiki/Panography) from a collection (for 
example from Flickr) by matching local features and using least-squares rigid 
registration. 
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CHAPTER4 


Camera Modeis and 
Augmented Reality 


In this chapter, we will look at modeling cameras and how to effectively use such 
modeis. In the previous chapter, we covered image to image mappings and transforms. 
To handle mappings between 3D and images, the projection properties of the camera 
generating the image needs to be part of the mapping. Here we show how to determine 
camera properties and how to use image projections for applications like augmented 
reality In the next chapter, we will use the camera model to look at applications with 
multiple views and mappings between them. 

4.1 The Pin-Hole Camera Model 

The pin-hole camera model (or sometimes projective camera model) is a widely used 
camera model in computer vision. It is simple and accurate enough for most applica¬ 
tions. The name comes from the type of camera, like a camera obscura, that collects 
light through a small hole to the inside of a dark box or room. In the pin-hole camera 
model, light passes through a single point, the camera center, C, before it is projected 
onto an image plane. Figure 4-1 shows an illustration where the image plane is drawn 
in front of the camera center. The image plane in an actual camera would be upside 
down behind the camera center, but the model is the same. 

The projection properties of a pin-hole camera can be derived from this illustration and 
the assumption that the image axis is aligned with the a: and y axis of a 3D coordinate 
System. The optical axis of the camera then coinddes with the z axis and the projection 
follows from similar triangles. By adding rotation and translation to put a 3D point 
in this coordinate system before projecting, the complete projection transform follows. 
The interested reader can find the details in [13] and [25, 26]. 

With a pin-hole camera, a 3D point X is projected to an image point x (both expressed 
in homogeneous coordinates) as 

Xx=PX. (4.1) 

Here, the 3x4 matrix P is called the camera matrix (or projection matrix). Note that 
the 3D point X has fourelements in homogeneous coordinates, X = [X, Y, Z, IT]. The 
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Figure 4-1. The pin-hole camera model. The image point x is at the intersection of the image plane 
and the linejoining the 3D point X and the camera center C. The dashed line is the optical axis ofthe 
camera. 


scalar X is the inverse depth of the 3D point and is needed if we want all coordinates to 
be homogeneous with the last value normalized to one. 

The Camera Matrix 

The camera matrix can be decomposed as 

P = /r [/? I t], (4.2) 

where Risa rotation matrix describing the orientation of the camera, t a 3D translation 
vector describing the position of the camera center, and the intrinsic calibration matrix 
K describing the projection properties of the camera. 

The calibration matrix depends only on the camera properties and is in a general form 
written as 

- a/ ^ - 

K= 0 f Cy . 

_ 0 0 1 _ 

Thefocal length, f, is the distance between the image plane and the camera center. The 
skew, s, is only used if the pixel array in the sensor is skewed and can in most cases 
safely be set to zero. This gives 

~ fx 0 - 

K= 0 fy Cy , (4.3) 

_ 0 0 1 _ 

where we used the alternative notation and fy, with = afy. 

The aspect ratio, a is used for non-square pixel elements. It is often safe to assume 
a = \. With this assumption, the matrix becomes 
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K = 


f 0 Cj, 

0 f Cy . 

_ 0 0 1 _ 

Besides the focal length, the only remaining parameters are the coordinates of the optical 
center (sometimes called the principal point), the image point c = [c^, Cy\ where the 
optical axis intersects the image plane. Since this is usually in the center of the image 
and image coordinates are measured from the top-left corner, these values are often 
well approximated with half the width and height of the image. It is worth noting that 
in this last case the only unknown variable is the focal length /. 

Projecting 3D Points 

Lets create a camera class to handle all the operations we need for modeling cameras 
and projections: 

from scipy import linalg 
class Camera(object): 

Class for representing pin-hole cameras. """ 

def _init_(self,P): 

Initialize P = K[Rlt] camera model. 
self.P = P 

self.K = None # calibration matrix 
self.R = None # rotation 
self.t = None # translation 
self.c = None # camera center 


def project(self,X): 

Project points in X (4*n array) and normalize coordinates. """ 

X = dot(self.P,X) 
for i in range(3): 

x[i] /= x[2] 
return x 

The example below shows how to project 3D points into an image view. In this example, 
we will use one of the Oxford multi-view datasets, the “Model House” data set, available 
at http://www.robots.ox.ac.uk/~vgg/data/data-mview.html. Download the3D geometry 
file and copy the house.pSd file to your working directory: 

import camera 

# load points 

points = loadtxt('house.p3d' )-T 

points = vstack((points,ones(points.shape[l]))) 

# Setup camera 

P = hstack((eye(3),array([[0],[0],[-10]]))) 
cam = camera.Camera(P) 
x = cam.project(points) 
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# plot projectiori 
figureO 

plot(x[0],x[l]/k.') 

show() 

First, we make the points into homogeneous coordinates and create a Camera object 
with a projection matrix before projection the 3D points and plotting them. The resuit 
looks like the middle plot in Figure 4-2. 

To see how moving the camera changes the projection, try the following piece of code 
that incrementally rotates the camera around a random 3D axis: 

# create transformation 
r = 0.05*random.rand(3) 

rot = camera.rotation_matrix(r) 

# rotate camera and project 
figureO 

for t in range(20): 
cam.P = dot(cam.P,rot) 

X = cam.project(points) 
plot(x[0],x[l],'k.') 
show() 

Flere we used the helper function rotation_matrix(), which creates a rotation matrix 
for 3D rotations around a vector (add this to camera.py): 

def rotation_matrix(a): 

Creates a 30 rotation matrix for rotation 
around the axis of the vector a. """ 

R = eye(4) 

R[:3,:3] = linalg.expm([[o,-a[ 2 ],a[l]],[a[ 2 ], 0 ,-a[o]],[-a[l],a[o], 0 ]]) 
return R 

Figure 4-2 shows one of the images from the sequence, a projection of the 3D points 
and the projected 3D point tracks after the points have been rotated around a random 
vector. Try this example a few times with different random rotations and you will get a 
feel for how the points rotate from the projections. 



Figure 4-2. An example of projecting 3D points: sample image (left); projected points into a view 
(middle); trajectory of projected points under camera rotation (right). Data from the Oxford “Model 
Flouse” dataset. 
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Factoring the Camera Matrix 

If we are given a camera matrix P of the form in equation (4.2), we need to be able 
to recover the internal parameters K and the camera position and pose t and R. 
Partitioning the matrix is caW&dfactorization. In this case, we will use a type of matrix 
factorization called RQjfactorization. 

Add the following method to the Camera class: 

def factor(self): 

Factorize the camera matrix into K,R,t as P = K[Rlt]. """ 

# factor first 3*3 part 

K,R = linalg.rq(self.P[:,:3]) 

# make diagonal of K positive 
T = diag(sign(diag(K))) 

if linalg.det(T) < 0: 

T[l,l] *= -1 

self.K = dot(K,T) 

self.R = dot(T,R) # T is its om inverse 
self.t = dot(linalg.inv(self.K),self.P[:,3]) 

return self.K, self.R, self.t 

RQ-factorization is not unique, there is a sign ambiguity in the factorization. Since we 
need the rotation matrix R to have positive determinant (otherwise the coordinate axis 
can get flipped), we can add a transform T to change the sign when needed. 

Try this on a sample camera to see that it works: 
import camera 

K = array([[l000,0,500],[0,1000,300],[0,0,1]]) 

tmp = camera.rotation_matrix([0,0,1])[:3,:3] 

Rt = hstack((tmp,array([[50],[40],[30]]))) 
cam = camera.Camera(dot(K,Rt)) 

print K,Rt 
print cam.factorO 

You should get the same printout in the console. 

Computing the Camera Center 

Given a camera projection matrix, P, it is useful to be able to compute the cameras 
position in space. The camera center, C, is a 3D point with the property PC = 0. For 
a camera with P = [i? 11], this gives 

K[R \ t]C = KRC + Kt = 0, 

and the camera center can be computed as 

C = -P^t. 

Note that the camera center is independent of the intrinsic calibration K, as expected. 
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Add the following method for computing the camera center according to the formula 
above and/or returning the camera center to the Camera class: 

def center(self): 

Compute and leturn the camera center. """ 

if self.c is not None: 

return self.c 
else: 

# compute c by factoring 
self .factorO 

self.c = -dot(self.R.TjSelf.t) 
return self.c 

This concludes the basic functions of our Camera class. Now, lets see how to work with 
this pin-hole camera model. 


4.2 Camera Calibration 

Calibrating a camera means determining the internal camera parameters, in our case 
the matrix It is possible to extend this camera model to include radial distortion and 
other artifacts if your application needs precise measurements. For most applications, 
however, the simple model in equation (4.3) is good enough. The Standard way to 
calibrate cameras is to take lots of pictures of a flat checkerboard pattern. For example, 
the calibration tools in OpenCV use this approach (see [3] for details). 


A Simple Calibration Method 

Flere we will look at a simple calibration method. Since most of the parameters can be 
set using basic assumptions (square straight pixels, optical center at the center of the 
image), the tricky part is getting the focal length right. For this calibration method, you 
need a flat rectangular calibration object (a book will do), measuring tape or a ruler, 
and a flat surface. Fleres what to do: 


• Measure the sides of your rectangular calibration object. Lets call these dX and 
dY. 

• Place the camera and the calibration object on a flat surface so that the camera 
back and calibration object are parallel and the object is roughly in the center of the 
cameras view. You might have to raise the camera or object to get a nice alignment. 

• Measure the distance from the camera to the calibration object. Lets call this dZ. 

• Take a picture and check that the setup is straight, meaning that the sides of the 
calibration object align with the rows and columns of the image. 

• Measure the width and height of the object in pixels. Lets call these dx and dy. 


See Figure 4-3 for an example of a setup. Now, using similar triangles (look at Figure 4-1 
to convince yourself of that), the following relation gives the focal lengths: 


fx = — dZ, 
^ dX 


dy 

dv' 


L = ^dZ. 

y 117 
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Figure 4-3. A simple camera calibration setup: an image of the setup used (left); the image used for 
the calibration (right). Measuring the width and height of the calibration object in the image and the 
physical dimensions of the setup is enough to determine the focal length. 


For the particular setup in Figure 4-3, the object was measured to be 130 by 185 mm, 
so dX = 130 and dY = 185. The distance from camera to object was 460 mm, so 
dZ = 460. You can use any unit of measurement; only the ratios of the measurements 
matter. Using ginputf) to select four points in the image, the width and height in pixels 
was 722 and 1040. This means that dx = 112. and dy = 1040. Putting these values in 
the relationship above gives 


A = 2555, A = 2586. 

Now, it is important to note that this is for a particular image resolutiori. In this case, the 
image was 2592 x 1936 pixels. Remember that the focal length and the optical center 
are measured in pixels and scale with the image resolution. If you take other image 
resolutions (for example a thumbnail image), the values will change. It is convenient to 
add the constants of your camera to a helper function like this: 


def my_calibration(sz): 
roWjCol = sz 
fx = 2555*col/2592 
fy = 2586*row/l936 
K = diag([fx,fy,l]) 
K[0,2] = 0.5*col 
K[l,2] = 0.5*row 
return K 


This function then takes a size tuple and returns the calibration matrix. Flere we assume 
the optical center to be the center of the image. Go ahead and replace the focal lengths 
with their mean if you like; for most consumer type cameras this is fine. Note that the 
calibration is for images in landscape orientation. For portrait orientation, you need 
to interchange the constants. Lets keep this function and make use of it in the next 
section. 
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4.3 Pose Estimation from Planes and Markers 

In Chapter 3, we saw how to estimate homographies between planes. Combining this 
with a calibrated camera makes it possible to compute the cameras pose (rotation and 
translation) if the image contains a planar marker object. This marker object can be 
almost any flat object. 

Lets illustrate with an example. Consider the two top images in Figure 4-4. The follow- 
ing code will extract SIFT features in both images and robustly estimate a homography 
using RANSAC: 

import homography 
import camera 
import sift 

# compute features 

sift.process_image('book_frontal.]PC','imo.sift') 
lOjdO = sift.read_features_from_file('imo.sift') 

sift.process_image('book_perspective.]PG','iml.sift') 
lljdl = sift.read_features_from_file('iml.sift') 



Figure 4-4. Example ofcomputing the projectiori matrix for a new view using a planar object as marker. 
Matching image features to an aligned marker gives a homography that can be used to compute the pose 
ofthe camera. Template image with a gray square (top left); an image takenfrom an unknown viewpoint 
with the same square transformed with the estimated homography (top right); a cube transformed using 
the estimated camera matrix (bottom). 
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# match features and estimate homography 
matches = sift.match_twosided(dO,dl) 
ndx = matches.nonzero()[o] 
fp = homography.make_homog(lo[ndx,:2].T) 
ndx2 = [int(matches[i]) for i in ndx] 
tp = homography.make_homog(ll[ndx2,:2].T) 

model = homography.RansacModelO 
H = homography.H_from_ransac(fp,tp,model) 


Now we have a homography that maps points on the marker (in this case the book) 
in one image to their corresponding locations in the other image. Lets define our 3D 
coordinate system so that the marker lies in the X-Y plane (Z = 0) with the origin 
somewhere on the marker. 

To check our results, we will need some simple 3D object placed on the marker. Here 
we will use a cube and generate the cube points using the function: 


def cube_points(c,wid): 

Creates a list of points for plotting 
a cube with plot. (the first 5 points are 
the bottom sguare, some sides repeated). """ 

P = [] 

# bottom 

p.append([c[o]-wid,c[l]-wid,c[2]-wid]) 

p.append([c[o]-wid,c[l]+wid,c[2]-wid]) 

p.append([c[o]+wid,c[l]+wid,c[2]-wid]) 

p.append([c[o]+wid,c[l]-wid,c[2]-wid]) 

p.append([c[o]-wid,c[l]-wid,c[2]-wid]) # same as first to close plot 

# top 

p.append([c[o] -wid,c[l] -wid,c[2]+wid]) 
p.append([c[o]-wid,c[l]+wid,c[2]+wid]) 
p.append([c[o]+wid,c[l]+wid,c[2]+wid]) 
p.append([c[o]+wid,c[l] -wid,c[2]+wid]) 

p.append([c[o]-wid,c[l]-wid,c[2]+wid]) # same as first to close plot 

# vertical sides 

p.append([c[o] -wid,c[l] -wid,c[2]+wid]) 
p.append([c[o] -wid,c[l]+wid,c[2]+wid]) 
p.append([c[o]-wid,c[l]+wid,c[2]-wid]) 
p.append([c[o]+wid,c[l]+wid,c[2]-wid]) 
p.append([c[o]+wid,c[l]+wid,c[2]+wid]) 
p.append([c[o]+wid,c[l]-wid,c[2]+wid]) 
p.append([c[o]+wid,c[l]-wid,c[2]-wid]) 

return array(p).T 


Some points are reoccurring so that plot() will generate a nice-looking cube. 

With a homography and a camera calibration matrix, we can now determine the relative 
transformation between the two views: 
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# camera calibration 

K = my_calibration(( 747 , 1000 )) 

# 30 points at plane z=0 with sides of length 0.2 
box = cube_points([o, 0 , 0 . 1 ], 0 . 1 ) 

# project bottom square in first image 

cami = camera.Camera( hstack((K,dot(K,array([[ 0 ],[ 0 ],[- 1 ]])) )) ) 

# first points are the bottom square 

box_caml = caml.project(homography.make_homog(box[:,: 5 ])) 

# use H to transfer points to the second image 
box_trans = homography.normalize(dot(H,box_caml)) 

# compute second camera matrix from cami and H 
cam2 = camera.Camera(dot(H,cami.P)) 

A = dot(linalg.inv(K),cam2.P[:,:3]) 

A = array([A[:,o],A[:,l],cross(A[:,o],A[:,l])]).T 
cam2.P[:,:3] = dot(K,A) 

# project uitb the second camera 

box_cam2 = cam2.project(homography.make_homog(box)) 


# test: projecting point on z=0 sbould give the same 
point = array([l,l,0,l]).T 

print homography.normalize(dot(dot(H,cami.P),point)) 
print cam2.project(point) 

Here we use a version of the image with resolutiori 747 x 1000 and first generate the 
calibration matrix for that image size. Next, points for a cube at the origin are created. 
The first five points generated by cube_points() correspond to the bottom, which in 
this case will lie on the plane defined by Z = 0, the plane of the marker. The first image 
(top left in Figure 4-4) is roughly a straight frontal view of the book and will be used 
as our template image. Since the scale of the scene coordinates is arbitrary we create a 
first camera with matrix 
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which has coordinate axis aligned with the camera and placed above the marker. The 
first five 3D points are projected onto the image. With the estimated homography, we 
can transform these to the second image. Plotting them should show the corners at the 
same marker locations (see top right in Figure 4-4). 

Now, composing Pi with H as a camera matrix for the second image, 

Pl = HP,, 

will transform points on the marker plane Z = 0 correctly This means that the first two 
columns and the fourth column of P 2 are correct. Since we know that the first 3x3 
block should h& KR and is a rotation matrix, we can recover the third column by 
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multiplying Pj with the inverse of the calibration matrix and replacing the third column 
with the cross product of the first two. 

As a sanity check, we can project a point on the marker plane with the new matrix 
and check that it gives the same projection as the same point transformed with the first 
camera and the homography You should get the same printout in your console. 

Visualizing the projected points can be done like this: 

imo = array(Image.open('book_frontal.DPG')) 
imi = array(Image.open('book_perspective.]PG')) 

# 2D projection of bottom square 
figureO 

imshow(imo) 

plot(box_caml[0,:],box_caml[l,:],lioewidth=3) 

# 2D projection transferred with H 
figureO 

imshow(iml) 

plot(box_trans[0,:],box_trans[l,:],lioewidth=3) 

# 3D cube 
figureO 
imshow(iml) 

plot(box_cam2[0,:],box_cam2[l,:],lioewidth=3) 
showO 

This should give three figures like the images in Figure 4-4. To be able to reuse these 
computations for future examples, we can save the camera matrices using Pickle: 

import pickle 

with open('ar_camera.pkl'/w') as f: 
pickle.dump(K,f) 

pickle.dump(dot(linalg.inv(K), cam2. P) ,f) 

Now we have seen how to compute the camera matrix given a planar scene object. We 
combined feature matching with homographies and camera calibration to produce a 
simple example of placing a cube in an image. With camera pose estimation, we now 
have the building blocks in place for creating simple augmented reality applications. 

4.4 Augmented Reality 

Augmented reality (AR) is a collective term for placing objects and information on top 
of image data. The classic example is placing a 3D computer graphics model so that 
it looks like it belongs in the scene, and moves naturally with the camera motion in 
the case of video. Given an image with a marker plane as in the section above, we can 
compute the cameras position and pose and use that to place computer graphics models 
so that they are rendered correctly In this last section of our camera chapter we will 
show how to build a simple AR example. We will use two tools for this, PyGame and 
PyOpenGL. 


4.4 Augmented Reality | 89 



PyGameand PyOpenGL 

PyGame is a popular package for game development that easily handles display Win¬ 
dows, input devices, events, and much more. PyGame is open source and available 
from http:llwww.pygame.orgl. It is actually a Python binding for the SDL game engine. 
For installation instructions, see Appendix A. For more details on programming with 
PyGame, see, for example, [21]. 

PyOpenGL is the Python binding to the OpenGL graphics programming interface. 
OpenGL comes pre-installed on almost all systems and is a crucial part for graphics 
performance. OpenGL is cross platform and works the same across operating systems. 
Take a look at http://www.opengl.org/ for more information on OpenGL. The getting 
started page {http://www.opengl.org/wiki/Getting_started) has resources for beginners. 
PyOpenGL is open source and easy to install; see Appendix A for details. More infor¬ 
mation can be found on the project website, http://pyopengl.sourceforge.net/. 

There is no way we can cover any significant portion of OpenGL programming. We 
will instead just show the important parts, for example how to use camera matrices in 
OpenGL and setting up a basic 3D model. Some good examples and demos are available 
in the PyOpenGL-Demo package {http://pypi.python.org/pypi/PyOpenGL-Demo). This 
is a good place to start if you are new to PyOpenGL. 

We want to place a 3D model in a scene using OpenGL. To use PyGame and PyOpenGL 
for this application, we need to import the following at the top of our scripts: 

from OpenGL.GL import * 
from OpenGL.GLU import * 
import pygame, pygame.image 
from pygame.locals import * 

As you can see, we need two main parts from OpenGL. The GL part contains all 
functions stating with “gl”, which, you will see, are most of the ones we need. The GLU 
part is the OpenGL Utility library and contains some higher-level functionality We 
will mainly use it to set up the camera projection. The pygame part sets up the window 
and event Controls, and pygame.image is used for loading image and creating OpenGL 
textures. The pygame.locals is needed for setting up the display area for OpenGL. 

The two main components of setting up an OpenGL scene are the projection and model 
view matrices. Lets get started and see how to create these matrices from our pin-hole 
cameras. 

From Camera Matrix to OpenGL Format 

OpenGL uses 4x4 matrices to represent transforms (both 3D transforms and projec- 
tions). This is only slightly different from our use of 3 x 4 camera matrices. Flowever, 
the camera-scene transformations are separated in two matrices, the GL_PROJEGTION 
matrix and the GL_MODELVIEV/ matrix. GL_PROJECTION handles the image for- 
mation properties and is the equivalent of our internal calibration matrix K. GL_ 
MODELVIEW handles the 3D transformation of the relation between the objects and 
the camera. This corresponds roughly to the R and t part of our camera matrix. One 
difference is that the coordinate system is assumed to be centered at the camera so the 
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GL_MODELVIEW matrix actually contains the transformation that places the objects 
in front of the camera. There are many peculiarities with working in OpenGL; we will 
comment on them as they are encountered in the examples helow. 

Given that we have a camera calihrated so that the calihration matrix K is known, the 
following function translates the camera properties to an OpenGL projection matrix: 

def set_projection_from_camera(K): 

Set view from a camera calihration matrix. """ 

glMatrixMode(GL_PRO]ECTION) 

glLoadldentityO 

fx = K[o,o] 
fy = K[l,l] 

fovy = 2*arctan(o.5*height/fy)’''l8o/pi 
aspect = (width*fy)/(height*fx) 

# define the near and far clipping planes 
near = 0.1 

far = 100.0 

# set perspectis/e 

gluPerspective(fovy,aspect,near,far) 
glViewport(0,0,width,height) 

We assume the calibration to be of the simpler form in (4.3) with the optical cen- 
ter at the image center. The first function glMatrixMode() sets the working matrix 
to GL_PROJECTION and subsequent commands will modify this matrix.^ Then 
glLoadldentityO sets the matrix to the identity matrix, basically reseting any prior 
changes. We then calculate the vertical field of view in degrees with the help of the 
image height and the cameras focal length as well as the aspect ratio. An OpenGL 
projection also has a near and far clipping plane to limit the depth range of what is ren- 
dered. We just set the near depth to be small enough to contain the nearest object and 
the far depth to some large number. We use the GLU utility function gluPerspective() 
to set the projection matrix and define the whole image to be the view port (essen- 
tially what is to be shown). There is also an option to load a full projection matrix with 
glLoadMatrixf 0 similar to the model view function helow This is useful when the 
simple version of the calibration matrix is not good enough. 

The model view matrix should encode the relative rotation and translation that brings 
the object in front of the camera (as if the camera was at the origin). It is a 4 x 4 matrix 
that typically looks like this: 

' R t' 

.0 ij ’ 

where Risa rotation matrix with columns equal to the direction of the three coordinate 
axis and t is a translation vector. When creating a model view matrix, the rotation part 


^ This is an odd way to handle things, but there are only two matrices to switch between, GL__PROJECTION 
and GL_MODELVIEW, so it is manageable. 
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will need to hold ali rotations (object and coordinate system) by multiplying together 
the individual components. 

The following function shows how to take a 3 x 4 pin-hole camera matrix with the 
calibration removed (multiply P with K~^) and create a model view: 

def set_modelview_froni_camera(Rt): 

Set the model view matrix from camera pose. """ 

glMatrixMode(CL_MODELVIEW) 

glLoadldentityO 

# rotate teapot 90 deg around x-axis so that z-axis is up 
Rx = array([[l,0,0],[0,0,-!],[0,1,0]]) 

# set rotation to best approximation 
R = Rt[:,:3] 

U,S,V = linalg.svd(R) 

R = dot(U,V) 

R[o,:] = -R[o,:] # change sign of x-axis 

# set translation 
t = Rt[:,3] 

# Setup 4*4 model view matrix 
M = eye(4) 

M[:3,:3] = dot(R,Rx) 

M[:3,3] = t 

# transpose and flatten to get column order 
M = M.T 

m = M.flattenO 

# replace model view with the new matrix 
glLoadMatrixf(m) 

First, we switch to work on the GL_MODELVIEW matrix and reset it. Then we create 
a 90-degree rotation matrix, since the object we want to place needs to be rotated 
(you will see below). Then we make sure that the rotation part of the camera matrix 
is indeed a rotation matrix, in case there are errors or noise when we estimated the 
camera matrix. This is done with SVD and the best rotation matrix approximation is 
given by 7? = UV^. The OpenGL coordinate system is a little different, so we flip the 
x-axis around. Then we set the model view matrix M by multiplying the rotations. The 
function glLoadMatrixf () sets the model view matrix and takes an array of the 16 values 
of the matrix taken column-wise. Transposing and then flattening accomplishes this. 


Placing Virtual Objects in the Image 

The first thing we need to do is to add the image (the one we want to place virtual 
objects in) as a background. In OpenGL this is done by creating a quadrilateral, a 
quod, that filis the whole view. The easiest way to do this is to draw the quad with the 
projection and model view matrices reset so that the coordinates go from — 1 to 1 in 
each dimension. 
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This function loads an image, converts it to an OpenGL texture, and places that texture 
on the quad: 

def draw_background(imname): 

Draw background image using a quad. """ 

# load background image (should be .bmp) to OpenGL texture 
bg_image = pygame.image.load(imname).convert() 

bg_data = pygame.image.tostring(bg_image,"RGBX",l) 

glMatrixMode(GL_MODELVIEW) 

glLoadldentityO 

glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT) 

# bind the texture 
glEnable(CL_TEXTURE_2D) 

glBindTexture(CL_TEXTLIRE_2D,glGenTextures(l)) 

glTexImage2D(GL_TEXTURE_2D,0,CL_RGBA,width,height,0,CL_RGBA,GL_UNSICNED_BYTE,bg_data) 

glTexParameterf(GL_TEXTURE_2D,GL_TEXTURE_MAG_FILTER,GL_NEAREST) 

glTexParameterf(GL_TEXTURE_2D,GL_TEXTURE_MIN_FILTER,GL_NEAREST) 

# create quad to fili the whole window 
glBegin(GL_QLIADS) 

glTexCoord2f(0.0,0.0); glVertexBf(-1.0,-1.0,-1.o) 
glTexCoord 2 f(l. 0 , 0 . 0 ); glVertex3f( 1 . 0 ,- 1 . 0 ,- 1 . 0 ) 
glTexCoord 2 f(l. 0 , 1 . 0 ); glVertex3f( 1 . 0 , 1 . 0 ,-l.o) 
glTexCoord2f(0.0,1.0); glVertex3f(-1.0, 1.0,-l.o) 
glEndO 

# ciear the texture 
glDeleteTextures(l) 

This function first uses some PyGame functions to load an image and serialize it to 
a raw string representation that can be used by PyOpenGL. Then we reset the model 
view and ciear the color and depth buffer. Next, we bind the texture so that we can 
use it for the quad and specify interpolation. The quad is defined with corners at —1 
and 1 in both dimensions. Note that the coordinates in the texture image go from 0 
to 1. Finally we ciear the texture so it doesnt interfere with what we want to draw later. 

Now we are ready to place objects in the scene. We will use the “hello world” com¬ 
puter graphics example, the Utah teapot {http://en.wikipedia.org/wiki/Utah_teapof). 
This teapot has a rich history and is available as one of the Standard shapes in GLUT: 

from OpenGL.GLUT import * 
glutSolidTeapot(size) 

This generates a solid teapot model of relative size size. 

The following function will set up the color and properties to make a pretty red teapot: 

def draw_teapot(size): 

Draw a red teapot at the origin. 
glEnable(CL_LIGHTINC) 
glEnable(GL_LICHT0) 
glEnable(GL_DEPTH_TEST) 
glClear(GL_DEPTH_BUFFER_BIT) 
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# droM red teapot 

glMaterialfv(GL_FRONT,GL_AMBIENT,[0,0,0,0]) 
glMaterialfv(GL_FRONT,GL_DIFFUSE,[0.5,0.0,0.0,0.0]) 
glMaterialfv(GL_FRONT,GL_SPECULAR,[0.7,0.6,0.6,0.0]) 
glMaterialf(GL_FRONT,GL_SHININESS,0.25*128.0) 
glutSolidTeapot(size) 

The first two lines enable lighting and a light. Lights are numbered as GL_LIGHT0, 
GL_LIGHT1, etc. We will only use one light in this example. The glEnable() function is 
used to turn on OpenGL features. These are defined with uppercase constants. Turning 
off a feature is done with the corresponding function glDisable(). Next, depth testing 
is turned on so that objects are rendered according to their depth (so that far-away 
objects are not drawn in front of near objects) and the depth buffer is cleared. Next, the 
material properties of the object, such as the diffuse and specular colors, are specified. 
The last line adds a solid Utah teapot with the specified material properties. 


Tying It AIITogether 

The full script for generating an image like the one in Figure 4-5 looks like this 
(assuming that you also have the functions introduced above in the same file): 

from OpenGL.GL import * 
from OpenGL.GLU import * 
from OpenGL.GLUT import * 
import pygame, pygame.image 
from pygame.locals import * 
import pickle 

widthjheight = 1000,747 

def setupO: 

Setup uindou and pygame environment. 
pygame.init() 

pygame.display.set_mode((width,height),OPENGL | DOUBLEBUF) 
pygame.display.set_caption('OpenGL AR demo') 

# load camera data 

with open('ar_camera.pkl','r') as f: 

K = pickle.load(f) 

Rt = pickle.load(f) 

setupO 

draw_background(' book_perspective.bmp' ) 
set_projection_from_camera(K) 
set_modelview_from_camera(Rt) 
draw_teapot(0.02) 

while True: 

event = pygame.event.poll() 
if event.type in (QLIIT,KEYDOWN): 
break 

pygame.display.flip() 
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First, this script loads the camera calibration matrix and the rotation and translation 
part of the camera matrix using Pickle. This assumes that you saved them as described 
on page 89. The setup() function initializes PyGame, sets the window to the size of 
the image, and makes the drawing area a double buffer OpenGL window. Next, the 
background image is loaded and placed to fit the window. The camera and model view 
matrices are set and finally the teapot is drawn at the correct position. 

Events in PyGame are handled using infinite loops with regular polling for any changes. 
These can be keyboard, mouse, or other events. In this case, we check if the application 
was quit or if akey was pressed and exit the loop. The command pygame.display.flipO 
draws the objects on the screen. 


Figure 4-5. Augmented reality. Placing a computer graphics model on a book in a scene using camera 
parameters computed from feature matches: the Utah teapot rendered in place aligned with the 
coordinate axis (top); sanity check to see the position of the origin (bottom). 
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The resuit should look like Figure 4-5. As you can see, the orientation is correct (the 
teapot is aligned with the sides of the cube in Figure 4-4). To check that the placement 
is correct, you can try to make the teapot really small by passing a smaller value for the 
size variable. The teapot should be placed close to the [0,0,0] corner of the cube in 
Figure 4-4. An example is shown in Figure 4-5. 


Loading Modeis 

Before we end this chapter, we will touch upon one last detail: loading 3D modeis and 
displaying them. The PyGame cookbook has a script for loading modeis in .obj format 
available at http://www.pygame.org/wiki/OBJFileLoader. You can learn more about the 
.obj format and the corresponding material file format at http://en.wikipedia.org/wiki/ 
Wavefront_.obj_file. 

Let s see how to use that with a basic example. We will use a freely available toy plane 
model from http://www .oyonale.com/modeles.php} Download the .obj version and save 
it as toyplane.obj. You can, of course, replace this model with any model of your choice; 
the code below will be the same. 

Assuming that you downloaded the file as objloader.py, add the following function to 
the file you used for the teapot example above: 

def load_and_draw_model(filename): 

Loads a model from an .obj file using objloader.py. 

Assumes tbere is a .mtl material file uitb the same name. 
glEnable(GL_LIGHTING) 
glEnable(GL_LIGHTO) 
glEnable(GL_DEPTH_TEST) 
glClear(GL_DEPTH_BUEFER_BIT) 

# set model color 

glMaterialfv(GL_FRONT,GL_AMBIENT,[0,0,0,0]) 
glMaterialfv(GL_ERONT,GL_DIEEUSE,[0.5,0.75,1.0,0.0]) 
glMaterialf(GL_FRONT,GL_SHININESS,0.25*128.0) 

# load from a file 
import objloader 

obj = objloader.0B](filename,swapyz=True) 
glCallList(obj.gl_list) 

Same as before, we set the lighting and the color properties of the model. Next, we load 
a model file into an OBI object and execute the OpenGL calls from the file. 

You can set the texture and material properties in a corresponding .mtl file. The 
objloader module actually requires a material file. Rather than modifying the loading 
script, we take the pragmatic approach of just creating a tiny material file. In this case, 
we’ll just specify the color. 


^ Modeis courtesy of Gilles Tran (Creative Commons License By Attribution). 
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Create a file toyplane.mtl with the following lines: 


newmtl lightblue 
Kd 0.5 0.75 1.0 

illum 1 

This sets the diffuse color of the object to a light grayish blue. Now, make sure to replace 
the “usemtl” tag in your .obj file with 

usemtl lightblue 

Adding textures we leave to the exercises. Replacing the call to draw_teapot() in the 
example above with 

load_and_draw_model('toyplane.obj') 

should generate a window like the one shown in Figure 4-6. 

This is as deep as we will go into augmented reality and OpenGL in this book. With 
the recipe for calibrating cameras, computing camera pose, translating the cameras into 
OpenGL format, and rendering models in the scene, the groundwork is laid for you to 
continue exploring augmented reality In the next chapter, we will continue with the 
camera model and compute 3D structure and camera pose without the use of markers. 



Figure 4-6. Loading a 3D model from an .obj file and placing it on a book in a scene using camera 
parameters computed from feature matches. 
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Exercises 

1. Modify the example code for the motion in Figure 4-2 to transform the points 
instead of the camera. You should get the same plot. Experiment with different 
transformations and plot the results. 

2. Some of the Oxford multi-view datasets have camera matrices given. Compute the 
camera positions for one of the sets an plot the camera path. Does it match with 
what you are seeing in the images? 

3. Take some images of a scene with a planar marker or object. Match features to a 
full-frontal image to compute the pose of each images camera location. Plot the 
camera trajectory and the plane of the marker. Add the feature points if you like. 

4. In our augmented reality example, we assumed the object to be placed at the 
origin and applied only the cameras position to the model view matrix. Modify 
the example to place several objects at different locations by adding the object 
transformation to the matrix. For example, place a grid of teapots on the marker. 

5. Take a look at the online documentation for .obj model files and see how to use 
textured models. Find a model (or create your own) and add it to the scene. 
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CHAPTER5 


Multiple View Geometry 


This chapter will show you how to handle multiple views and how to use the geometric 
relationships between them to recover camera positions and 3D structure. With images 
taken at different view points, it is possible to compute 3D scene points as well as 
camera locations from feature matches. We introduce the necessary tools and show 
a complete 3D reconstruction example. The last part of the chapter shows how to 
compute dense depth reconstructions from stereo images. 

5.1 Epipolar Geometry 

Multiple view geometry is the field studying the relationship between cameras and 
features when there are correspondences between many images that are taken from 
varying viewpoints. The image features are usually interest points, and we will focus 
on that case throughout this chapter. The most important constellation is two-view 
geometry 

With two views of a scene and corresponding points in these views, there are geometric 
constraints on the image points as a resuit of the relative orientation of the cameras, 
the properties of the cameras, and the position of the 3D points. These geometric 
relationships are described by what is called epipolar geometry. This section will give 
a very short description of the basic components we need. For more details on the 
subject, see [13]. 

Without any prior knowledge of the cameras, there is an inherent ambiguity in that a 
3D point, X, transformed with an arbitrary (4 x 4) homography H as HX will have the 
same image point in a camera PH~^ as the original point in the camera P. Expressed 
with the camera equation, this is 

Xx= PX= PH~^HX = PX. 

Because of this ambiguity, when analyzing two view geometry we can always transform 
the cameras with a homography to simplify matters. Often this homography is just a 
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rigid transformation to change the coordinate system. A good choice is to set the origin 
and coordinate axis to align with the first camera so that 

Pj = KiU I 0 ] and P 2 = K 2 [R \ t]. 

Here we use the same notation as in Chapter 4; Ki and K 2 are the calibration matrices, 
R is the rotation of the second camera, and t is the translation of the second camera. 
Using these camera matrices, one can derive a condition for the projection of a point X 
to image points Xj and X 2 (with Pi and P 2 , respectively). This condition is what makes 
it possible to recover the camera matrices from corresponding image points. 

The following equation must be satisfied: 

X 2 T'xi = 0, (5.1) 

where 

F = K-'^ S^R K~\ 

and the matrix S^ is the skew symmetric matrix 

0 —^3 t2 

Si= t2 0 -tj . (5.2) 

- h 0 _ 

Equation (5.1) is called the epipolar constraint. The matrix F in the epipolar constraint 
is called the fundamental matrix and as you can see, it is expressed in components of 
the two camera matrices (their relative rotation R and translation t). The fundamental 
matrix has rank 2 and det(P) = 0. This will be used in algorithms for estimating F. 

The equations above mean that the camera matrices can be recovered from F, which 
in turn can be computed from point correspondences, as we will see later. Without 
knowing the internal calibration {Ki and K 2 ), the camera matrices are only recoverable 
up to a projective transformation. With known calibration, the reconstruction will be 
metric. A metric reconstruction is a 3D reconstruction that correctly represents distances 
and angles.^ 

There is one final piece of geometry needed before we can proceed to actually using 
this theory on some image data. Given a point in one of the images, for example X 2 in 
the second view, equation (5.1) defines a line in the first image since 

xj P x^ = l^x^ = 0. 

The equation 1 ^X 3 = 0 determines a line with all points X 3 in the first image satisfying 
the equation belonging to the line. This line is called an epipolar line corresponding 
to the point X 2 . This means that a corresponding point to X 2 must lie on this line. The 
fundamental matrix can therefore Help the search for correspondences by restricting 
the search to this line. 

^ The absolute scale of the reconstruction cannot be recovered, but that is rarely a problem. 
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Figure 5-1. An illustration of epipolar geometry. A 3D point X is projected to and X 2 , in the two 
views, respectively. The baseline between the two camera centers, Cj and C 2 , intersect the image planes 
in the epipoles, Ci and Ci- The lines li and Ij are called epipolar lines. 


The epipolar lines all meet in a point, e, called the epipole. The epipole is actually 
the image point corresponding to the projection of the other camera center. This point 
can be outside the actual image, depending on the relative orientation of the cameras. 
Since the epipole lies on all epipolar lines, it must satisfy Fci = 0. It can, therefore, 
be computed as the null vector of F, as we will see later. The other epipole can be 
computed from the relation ejT = 0. The epipoles and the epipolar lines are illustrated 
in Figure 5T. 


A Sample Data Set 

In the Corning sections, we will need a data set with image points, 3D points, and camera 
matrices to experiment with and illustrate the algorithms. We will use one of the sets 
from the Oxford multi-view datasets available at http:llwww.robots.ox. ac.ukl-vggldatal 
data-mview.html. Download the zipped file for the Mertonl data. The following script 
will load all the data for you: 

import camera 

# load some images 

imi = array(Image.open('images/ool.jpg')) 
im2 = array(Image.open('images/ 002 .jpg')) 

# load 2D points for each view to a list 

points2D = [loadtxt('2D/oo'+str(i+l)+'.corners').T for i in range(3)] 

# load 30 points 

pointsSD = loadtxt('3D/p3d').T 

# load correspondences 

corr = genfromtxt(' 2D/nview-corners' ,dtype='int' ,missing='*') 

# load cameras to a list of Camera objects 

P = [camera.Camera(loadtxt('2D/oo'+str(i+l)+'.P')) for i in range(3)] 
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This will load the first two images (out of three), all the image feature points^ for the 
three views, the reconstructed 3D points corresponding to the image points, which 
points correspond across views, and finally the camera matrices (where we used the 
Camera class from the previous chapter). Here we used loadtxt() to read the text files 
to NumPy arrays. The correspondences contain missing data, since not all points are 
visible or successfully matched in all views. The correspondences need to be loaded 
with this taken into account. The function genfroiritxt() solves this by replacing the 
missing values (denoted with in this file) with —1. 

A convenient way of running this script and getting all the data is to save the code 
above in a file, for example load_vggdata.py, and use the command execfile() at the 
beginning of your Scripts or experiments: 

execfile('load_vggdata.py') 

Lets see what this data looks like. Try to project the 3D points into one view and 
compare the results with the observed image points: 

# make 30 points homogeneous and project 

X = vstack( (points3D,ones(points3D.shape[l])) ) 

X = P[o].project(X) 

# plotting the points in view 1 
figureO 

inishow(iml) 

plot(points 2 D[ 0 ][o],points 2 D[o][l],'*') 
axis('off') 

figureO 
inishow(iml) 
plot(x[0],x[l]/r.') 
axis('off') 

show() 

This creates a plot with the first view and image points in that view; for comparison, the 
projected points are shown in a separate figure. Figure 5-2 shows the resulting plots. If 
you look closely you will see that the second plot with the projected 3D points contains 
more points than the first. These are image feature points reconstructed from view 2 
and 3 but not detected in view 1. 


Plotting 3D Data with Matplotiib 

To visualize our 3D reconstructions, we need to be able to plot in 3D. The mplotsd 
toolkit in Matplotiib provides 3D plotting of points, lines, contours, surfaces and most 
other basic plotting components as well as 3D rotation and scaling from the Controls 
of the figure window. 


^ Actually Harris corner points; see Section 2.1. 
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Figure 5-2. The Mertonl data set from the Oxford multi-view datasets: view 1 with image points 
shown (left); view 1 with projected 3D points (right). 


Making a plot in 3D is done by adding the projection="3d" keyword to the axes object 
like this: 

from mpl_toolkits.mplot3d import axesBd 
fig = figureO 

ax = fig.gca(projection="3d'') 

# generate 30 sample data 

X,Y,Z = axes3d.get_test_data(0.25) 

# plot the points in 30 

ax.plot(X.flatten(),Y.flatten(),Z.flatten(),'o') 
show() 

The function get_test_data() generates sample points on a regulat x, y grid with the 
parameter determining the spacing. Flattening these grids gives three lists of points that 
can be sent to plot(). This should plot 3D points on what looks like a surface. Try it 
out and see for yourself. 

Now we can plot the Merton sample data to see what the 3D points look like: 

# plotting 30 points 

from mpl_toolkits.mplot3d import axes3d 
fig = figureO 

ax = fig.gca(projection='3d') 
ax.plot(points3D[0],points3D[l],points3D[2],'k.') 

Figure 5-3 shows the 3D points from three different views. The figure window and 
Controls look like the Standard plot Windows for images and 2D data with an additional 
3D rotation tool. 

Computing F—The Eight Point Algorithm 

The eight point algorithm is an algorithm for computing the fundamental matrix from 
point correspondences. Fleres a brief description; the details can be found in [14] 
and [13]. 
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Figure 5-3. The 3D points ofthe Mertonl data setfrom the Oxford multi-view datasets shown using 
Matplotlib; view from above and to the side (left); view from the top, showing the building walls and 
points on the roof (middle); side view showing the profile of one of the walls and a frontal view of 
points on the other wall (right). 


The epipolar constraint (5.1) can be written as a linear system like 
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where f contains the elements of F, = [x\, Wj] and = [x^, is a cor- 

respondence pair, and there are n point correspondences in total. The fundamental 
matrix has nine elements, but since the scale is arbitrary, only eight equations are 
needed. Eight point correspondences are therefore needed to compute F\ hence the 
name of the algorithm. 

Create a file sfm.py, and add the following function for the eight point algorithm that 
minimizes ||Af||: 

def compute_fundamental(xl,x2): 

Computes the fundamental matrix from corresponding points 
(xljX2 3*n arrays) using the normalized 8 point algorithm. 
each row is constructed as 
[x'*x, x'*y, x', y'*x, y'*y, y', x, y, 1 ] 

n = xl.shape[l] 
if x2.shape[l] != n: 

raise ValueError("Number of points don't match.") 

# build matrix for equations 
A = zeros((n,9)) 
for i in range(n): 

A[i] = [xl[o,i]*x2[o,i], xl[o,i]’''x2[l,i], xl[o,i]*x2[2,i], 
xl[l,i]*x2[o,i], xl[l,i]*x2[l,i], xl[l,ii*x2[2,i], 
xl[2,i]*x2[0,i], xl[2,i]*x2[l,i], xl[2,i]*x2[2,i] ] 


# compute linear least square solution 
U,S,V = linalg.svd(A) 

F = V[-l].reshape(3,3) 
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# constrain F 

# make rank 2 by zeroing out last singular value 
LI,S,V = linalg.svd(F) 

S[2] = 0 

F = dot(LI,dot(diag(S),V)) 
return F 

As usual, we compute the least squares solution using SVD. Since the resulting solution 
might not have rank 2 as a proper fundamental matrix should, we replace the resuit 
with the closest rank 2 approximation by zeroing out the last singular value. This is 
a Standard trick and a useful one to know. The function ignores the important step 
of normalizing the image coordinates. Ignoring normalization could give numerical 
problems. Lets leave that for later. 

The Epipole and Epipolar Lines 

As mentioned at the start of this section, the epipole satisfies Tei = 0 ^^d can be 
computed from the null space of F. Add this function to sfm.py. 

def compute_epipole(F): 

Computes the (right) epipole from a 
fundamental matrix F. 

(Use with F.T for left epipole.) """ 

# return null space of F (Fx=o) 

LI,S,V = linalg.svd(F) 

e = V[-l] 
return e/e[2] 

If you want the epipole corresponding to the left null vector (corresponding to the 
epipole in the other image), just transpose F before passing it as input. 

We can try these two functions on the first two views of our sample data set like this: 
import sfm 

# index for points in first two views 
ndx = (corr[;,0]>=0) & (corr[:,l]>=0) 

# get coordinates and make homogeneous 
xl = points2D[o][;,corr[ndx,0]] 

xl = vstack( (xl,ones(xl.shape[l])) ) 
x2 = points2D[l][:,corr[ndx,l]] 
x2 = vstack( (x2,ones(x2.shape[l])) ) 

# compute F 

F = sfm.compute_fundamental(xl,x2) 

# compute the epipole 

e = sfm.compute_epipole(F) 

# plotting 
figureO 
imshow(iml) 
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# plot each line individually, this gives nice colors 
for i in range(5): 

sfm.plot_epipolar_line(iml,F,x2[:,i],e,False) 
axis('off') 

figureO 

inishow(im2) 

# plot each point individually, this gives same colors as the lines 
for i in range(5): 

plot(x 2 [o,i],x 2 [l,i],'o') 
axis('off') 

show() 


First, the points that are in correspondence between the two images are selected and 
made into homogeneous coordinates. Here we just read them from a text file; in 
reality these would be the resuit of extracting features and matching them as we did 
in Chapter 2. The missing values in the correspondence list corr are —1, so picking 
indices greater or equal to zero gives the points visible in each view. The two conditions 
are combined with the array operator &. 

Finally the first five of the epipolar lines are shown in the first view and the correspond- 
ing matching points in view 2. Here we used the helper plot function: 


def plot_epipolar_line(ini, F,x,epipole=None,show_epipole=True): 
Plot the epipole and epipolar line f*x=0 
in an image. F is the fundamental matrix 
and X a point in the other image.""" 

m,n = im.shape[:2] 
line = dot(F,x) 

# epipolar line parameter and values 
t = linspace(0,n,100) 

It = array([(line[2]+line[o]*tt)/(-line[l]) for tt in t]) 

# take only line points inside the image 
ndx = (lt>=0) & (lt<m) 

plot(t[ndx],lt[ndx],linewidth=2) 

if show_epipole: 
if epipole is None: 

epipole = conipute_epipole(F) 
plot (epipole [o]/epipole[2 ] ,epipole[l]/epipole [2], 'r*') 


This function parameterizes the line with the range of the x axis and removes parts 
of lines above and below the image border. If the last parameter show_epipole is true, 
the epipole will be plotted as well (and computed if not passed as input). The plots are 
shown in Figure 5-4. The color coding matches between the plots so you can see that 
the corresponding point in one image lies somewhere along the same-color line as a 
point in the other image. 
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Figure 5-4. Epipolar lines in view 1 shown for five points in view 2 of the Mertonl data. The bottom 
rovj shows a closeup ofthe area around the points. The lines can be seen to converge on a point outside 
the image to the left. The lines show where point correspondences can befound in the other image (the 
color coding matches between lines and points). 


5.2 Computing with Cameras and 3D Structure 

The previous section covered relationships between views and how to compute the 
fundamental matrix and epipolar lines. Here we briefly explain the tools we need for 
computing with cameras and 3D structure. 


Triangulation 

Given known camera matrices, a set of point correspondences can be triangulated to 
recover the 3D positions of these points. The basic algorithm is fairly simple. 

For two views with camera matrices Pi and P 2 , each with a projection and X 2 of the 
same 3D point X (all in homogeneous coordinates), the camera equation (4.1) gives the 
relation 


■ Pi -Xi 

_P2 0 



- X - 



-* 

-^2- 


There might not be an exact solution to these equations due to image noise, errors in 
the camera matrices, or other sources of errors. Using SVD, we can get a least squares 
estimate of the 3D point. 


5.2 Computing with Cameras and 3D Structure | 107 









Add the following function that computes the least squares triangulation of a point pair 
to sfm.py. 

def triangulate_point(xl,x2,Pl,P2): 

Point pair triangulation from 
least squares solution. """ 

M = zeros((6,6)) 

M[:3,:4] = Pl 
M[3:,:4] = P2 
M[:3,4] = -XI 
M[3:,5] = -X2 

U,S,V = linalg.svd(M) 

X = V[-l,:4] 

return X / X[3] 

The first four values in the last eigenvector are the 3D coordinates in homogeneous co- 
ordinates. To triangulate many points, we can add the following convenience function: 

def triangulate(xl,x2,Pl,P2): 

Two-view triangulation of points in 
xl,x2 (3*n homog. coordinates). """ 

n = xl.shape[l] 
if x2.shape[l] != n: 

raise ValueError("Number of points don't match.") 

X = [ triangulate_point(xl[:,i])X2[:,i],Pl,P2) for i in range(n)] 
return array(X).T 

This function takes two arrays of points and returns an array of 3D coordinates. 

Try the triangulation on the Mertonl data like this: 

import sfm 

# index for points in first tuo vieus 
ndx = (corr[:,o]>=o) & (corr[:,l]>=0) 

# get coordinates and make homogeneous 
xl = points2D[o][:,corr[ndx,0]] 

xl = vstack( (xl,ones(xl.shape[l])) ) 
x2 = points2D[l][:,corr[ndx,l]] 
x2 = vstack( (x2,ones(x2.shape[l])) ) 

Xtrue = points3D[:,ndx] 

Xtrue = vstack( (Xtrue,ones(Xtrue.shape[l])) ) 

# check first 3 points 

Xest = sfm.triangulate(xl,x2,P[o].P,P[l].P) 
print Xest[;,:3] 
print Xtrue[:,:3] 
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# plotting 

from mpl_toolkits.mplot3d import axesBd 
fig = figureO 

ax = fig.gca(projection='3d') 
ax.plot(Xest[o],Xest[l],Xest[2],'ko') 
ax.plot(Xtrue[0],Xtrue[l],Xtrue[2],'r.') 
axis('equal') 

show() 

This will triangulate the points in correspondence from the first two views and print 
out the coordinates of the first three points to the console before plotting the recovered 
3D points next to the true values. The printout looks like this: 

[[ 1.03743725 1.56125273 1.40720017] 

[-0.57574987 -0.55504127 -0.46523952] 

[ 3.44173797 3.44249282 7.53176488] 

[ 1 . 1 . 1 . ]] 

[[ 1.0378863 1.5606923 1.4071907 ] 

[- 0.54627892 - 0.5211711 - 0 . 46371818 ] 

[ 3.4601538 3.4636809 7.5323397 ] 

[ 1 . 1 . 1 . ]] 

The estimated points are close enough. The plot looks like Figure 5-5; as you can see, 
the points match fairly well. 

Computing the Camera Matrix from 3D Points 

With known 3D points and their image projections, the camera matrix, P, can be 
computed using a direct linear transform approach. This is essentially the inverse 
problem to triangulation and is sometimes called camera resectioning. This way to 
recover the camera matrix is again a least squares approach. 



Figure 5-5. Triangulated points using camera matrices and point correspondences. The estimated points 
are shown with dark circles and the true points with light dots. View from above and to the side (left). 
Closeup ofthe points from one of the building walls (right). 
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From the camera equation (4.1), each visible 3D pointX,- (in homogeneous coordinates) 
is projected to an image point x,- = [a:,-, y,-, 1] as = PX,- and the corresponding 
points satisfy the relation 



where p^, p 2 , and P 3 are the three rows of P. This can he written more compactly as 

Mv = 0. 

The estimation of the camera matrix is then ohtained using SVD. With the matrices 
descrihed ahove, the code is straightforward. Add the function below to sfm.py. 

def compute_P(x,X): 

Computa camera matrix from pairs of 
2 D- 3 D correspondences (in homog. coordinates). """ 

n = x.shape[l] 
if X.shape[l] != n: 

raise ValueError("Number of points don't match.") 

# create matrix for DLT solution 
M = zeros((3*n,12+n)) 
for i in range(n): 

M[3*i,0:4] = X[:,i] 

M[3*i+l,4:8] = X[:,i] 

M[3*i+2,8:12] = X[:,i] 

M[3*i:3*i+3,i+l2] = -x[:,i] 

U,S,V = linalg.svd(M) 

return V[-l,:12].reshape((3,4)) 

This function takes the image points and 3D points and builds up the matrix M ahove. 
The first 12 values of the last eigenvector are the elements of the camera matrix and are 
returned after a reshaping operation. 

Again, lets try this on our sample data set. The following script will pick out the points 
that are visible in the first view (using the missing values from the correspondence list), 
make them into homogeneous coordinates, and estimate the camera matrix: 

import sfm, camera 

corr = corr[:,0] # vieu 1 

ndx3D = where(corr>=o)[ 0 ] # missing values are -1 
ndx2D = corr[ndx3D] 
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# select visible points and make homogeneous 
X = points2D[o][:,ndx2D] # vieu 1 
X = vstack( (x,ones(x.shape[l])) ) 

X = points3D[:,ndx3D] 

X = vstack( (X,ones(X.shape[l])) ) 


# estimate P 

Pest = camera.Camera(sfm.compute_P(x,X)) 

# compare! 

print Pest.P / Pest.P[2,3] 
print P[o].P / P[o].P[2,3] 

xest = Pest.project(X) 

# plotting 
figureO 
imshow(iml) 
plot(x[o],x[l],'bo') 

plot(xest[o],xest[ 1], 'r.') 
axis('off') 


show() 

To check the camera matrices, they are printed to the console in normalized form (by 
dividing with the last element). The printout looks like this: 


[[ 1 . 065207946+00 
[ -5.057731156+01 
[ 3.051219156-03 

[[ 1 . 067746796+00 
[ -5.058343646+01 
[ 3.067926596-03 


-5.234312756+01 

-1.332432766+01 

- 3 . 192646846-02 

-5.23448212e+01 

- 1 . 332019766+01 

- 3 . 190080546-02 


2.069027496+01 

-1.473885376+01 

-3.43703738e-02 

2.06926980e+01 

- 1 . 474066416+01 

- 3 . 436651296-02 


5.087293056+02] 

4 . 791788380 + 02 ] 

l.OOOOOOOOe+OO]] 

5.087644876+02] 

4.792289980+02] 

l.OOOOOOOOe+OO]] 


The top is the estimated camera matrix, and below is the one computed by the creators 
of the data set. As you can see, they are almost identical. Last, the 3D points are 
projected using the estimated camera and plotted. The resuit looks like Figure 5-6 with 
the true points shown as circles and the estimated camera projection as dots. 



Figure 5-6. Projected points in view 1 computed using an estimated camera matrix. 
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Computing the Camera Matrix from a Fundamental Matrix 

In a two view scenario, the camera matrices can be recovered from the fundamental 
matrix. Assuming the first camera matrix is normalized to /"i = [/ | 0], the problem is 
to find the second camera matrix P 2 . There are two different cases, the uncalibrated 
case and the calibrated case. 

The uncalibrated case—projective reconstruction 

Without any knowledge of the cameras intrinsic parameters, the camera matrix can 
only be retrieved up to a projective transformation. This means that if the camera pair 
is used to reconstruet 3D points, the reconstruction is only accurate up to a projective 
transformation (you can get any solution out of the whole range of projective scene 
distortions). This means that angles and distances are not respected. 

Therefore, in the uncalibrated case the second camera matrix can be chosen up to a 
(3 X 3) projective transformation. A simple choice is 

P2=[S^F\e], 

whereeis theleftepipole, = 0 and a skew matrix as in equation (5.2). Remember, 
a triangulation with this matrix will most likely give distortions, for example in the 
form of skewed reconstructions. 

Here is what it looks like in code: 

def compute_P_from_fundamental(F): 

Computes the second camera matrix (assuming Pl = [I o]) 
from a fundamental matrix. """ 

e = compute_epipole(F.T) # left epipole 
Te = skew(e) 

return vstack((dot(Te,F.T).T,e)).T 
We used the helper function skew() defined as: 
def skew(a): 

Skev! matrix A such that a x v = Av for any v. """ 
return array([[o,-a[2],a[l]],[a[2],0,-a[o]],[-a[l],a[o], 0 ]]) 

Add both these functions to the file sfm.py. 

The calibrated case—metric reconstruction 

With known calibration, the reconstruction will be metric and preserve properties of 
Euclidean space (except for a global scale parameter). In terms of reconstructing a 3D 
scene, this calibrated case is the interesting one. 

With known calibration K, wecan apply its inverse K~^to the image points 
so that the camera equation becomes 

:g.^ = K-^K[R I t]X=[R \ t]X, 

in the new image coordinates. The points in these new image coordinates satisfy the 
same fundamental equation as before: 
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= 0 - 

The fundamental matrix for calibration-normalized coordinates is called the essmtial 
matrix and is usually denoted E instead of F, to make it ciear that this is the calibrated 
case and the image coordinates are normalized. 

The camera matrices recovered from an essential matrix respect metric relationships 
but there are four possible Solutions. Only one of them has the scene in front of both 
cameras, so it is easy to pick the right one. 

Here is an algorithm for computing the four Solutions (see [13] for the details). Add this 
function to sfm.py. 

def compute_P_from_essential(E): 

Computes the second camera matrix (assuming Pl = [I o]) 
from an essential matrix. Output is a list of four 
possible camera matrices. 

# make sure E is rank 2 
U,S,V = svd(E) 

if det(dot(U,V))<0: 

V = -V 

E = dot(LI,dot(diag([l,l,o]),V)) 

# create matrices (Hartley p 258) 

Z = skew([0,0,-l]) 

W = array([[o,-1,0],[1,0,0],[0,0,1]]) 

# return all four Solutions 

P2 = [vstack((dot(LI,dot(W,V)) .T,U[: ,2])) .T, 
vstack((dot(LI,dot(W,V)) .T, -U[:,2])).T, 
vstack((dot(U,dot(W.T,V)) .T,LI[: ,2])) .T, 
vstack((dot(U,dot(W.T,V)) .T, -LI[;,2])).T] 

return P2 

First, this function makes sure the essential matrix is rank 2 (with two equal non-zero 
singular values), then the four Solutions are created according to the recipe in [13]. A 
list with four camera matrices is returned. How to pick the right one, we leave to an 
example later. 

This concludes all the theory needed to compute 3D reconstructions from a collection 
of images. 

5.3 Multiple View Reconstruction 

Lets look at how to use the concepts above to compute an actual 3D reconstruction 
from a pair of images. Computing a 3D reconstruction like this is usually referred to 
as structure from motion {SfM) since the motion of a camera (or cameras) gives you 3D 
structure. 

Assuming the camera has been calibrated, the steps are as follows: 

1. Detect feature points and match them between the two images. 

2. Compute the fundamental matrix from the matches. 
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3. Compute the camera matrices from the fundamental matrix. 

4. Triangulate the 3D points. 

We have all the tools to do this, but we need a robust way to compute a fundamental 
matrix when the point correspondences between the images contain incorrect matches. 


Robust Fundamental Matrix Estimation 

Similar to when we needed a robust way to compute homographies (Section 3.3), we 
also need to be able to estimate a fundamental matrix when there is noise and incorrect 
matches. As before, we will use RANSAC, this time combined with the eight point 
algorithm. It should be mentioned that the eight point algorithm breaks down for 
planar scenes, so you cannot use it for scenes where the scene points are all on a 
plane. 

Add this class to sfm.py. 

class RansacModel(object): 

Class for fundmental matrix fit with ransac.py from 
http://www.scipy.org/Cookbook/RANSAC""" 

def _init_(self,debug=False): 

self.debug = debug 

def fit(self,data): 

""" Estimate fundamental matrix using eight 
selected correspondences. """ 

# transpose and split data into the tuo point sets 
data = data.T 

xl = data[:3,:8] 
x 2 = data[3:,:8] 

# estimate fundamental matrix and return 
F = compute_fundamental_normalized(xl,x2) 
return F 

def get_error(self,data,F): 

Compute x''T F x for all correspondenceSj 
return error for each transformed point. 

# transpose and split data into the tuo point 
data = data.T 

xl = data[:3] 
x 2 = data[3:] 

# Sampson distance as error measure 
Fxl = dot(F,xl) 

Fx2 = dot(F,x2) 

denom = Fxl[o ]**2 + Fxl[l ]**2 + Fx 2 [o ]**2 + Fx 2 [l ]**2 
err = ( diag(dot(xl.T,dot(F,x2))) )**2 / denom 

# return error per point 
return err 
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As before, we need fit() and get_error() methods. The error measure chosen here is 
the Sampson distance (see [13]). The fit() method now selects eight points and uses 
a normalized version of the eight point algorithm: 

def coitipute_fundaniental_normalized(xl,x2): 

Computes the fundamental matrix from corresponding points 
(xl,x2 3*n arrays) using the normalized 8 point algorithm. """ 

n = xl.shape[l] 
if x2.shape[l] != n: 

raise ValueError("Number of points don't match.") 

# normalize image coordinates 
xl = xl / xl[2] 

mean_l = mean(xl[:2],axis=l) 

51 = sqrt(2) / std(xl[:2]) 

Tl = array([[Sl,0,-Sl*mean_l[0]],[0,Sl,-Sl*mean_l[l]],[ 0 , 0 , 1 ]]) 
xl = dot(Tl,xl) 

x2 = x2 / X2[2] 

mean_2 = mean(x2[:2],axis=l) 

52 = sqrt(2) / std(x2[:2]) 

T2 = array([[S2,0,-S2*mean_2[o]],[o,S2,-S2*mean_2[l]],[ 0 , 0 , 1 ]]) 
x2 = dot(T2,x2) 

# compute F with the normalized coordinates 
F = compute_fundamental(xl,x2) 

# reverse normalization 
F = dot(Tl.T,dot(F,T2)) 

return F/F[2,2] 

This function normalizes the image points to zero mean and fixed variance. 

Now we can use this class in a function. Add the following function to sfm.py. 

def F_from_ransac(xl,x2,model,niaxiter=5000,match_theshold=le-6): 

Robust estimation of a fundamental matrix F from point 
correspondences using RANSAC (ransac.py from 
http://iMw.scipy. org/Cookbook/RANSAC). 

input: xl,x2 (3*n arrays) points in hom. coordinates. """ 

import ransac 

data = vstack((xl,x2)) 

# compute F and return with inlier index 

F,ransac_data = ransac.ransac(data.T,model, 8, niaxiter,match_theshold,20, 

return_all=True) 

return F, ransac_data['inliers'] 

Here we return the hest fundamental matrix F together with the inlier index so that we 
know what matches were consistent with F. Compared to the homography estimation, 
we increased the default max iterations and changed the matching threshold, which was 
in pixels hefore and is in Sampson distance now. 
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Figure 5-7. Example image pair of a scene where the images are taken at different viewpoints. 


3D Reconstruction Example 

In this section, we will see a complete example of reconstructing a 3D scene from start 
to finish. We will use two images taken with a camera with known calibration. The 
images are of the famous Alcatraz prison and are shown in Figure 5-7.^ 

Lets split up the code into a few chunks so that it is easier to follow. First, we extract 
features, match them, and estimate a fundamental matrix and camera matrices: 

import homography 
import sfm 
import sift 

# calibration 

K = array([[2394,0,932],[0,2398,628],[0,0,1]]) 


# load images and compute features 

imi = array(Image.open( 'alcatrazl.jpg' )) 

sift.process_image('alcatrazl.jpg','iml.sift') 

11, dl = sift.read_features_from_file('iml.sift') 

im2 = array(Image.open( 'alcatraz2.jpg' )) 
sift.process_image( 'alcatraz2.jpg' ,'im2.sift') 

12, d2 = sift.read_features_from_file('im2.sift') 


# match features 

matches = sift.match_twosided(dl,d2) 
ndx = matches.nonzero()[o] 

# make homogeneous and normalize with inv(K) 
xl = homography.make_homog(ll[ndx,:2].T) 
ndx2 = [int(matches[i]) for i in ndx] 

x2 = homography.make_homog(l2[ndx2,:2].T) 

xln = dot(inv(K),xl) 
x2n = dot(inv(K),x2) 


^ Images courtesy of Cari Olsson (http://www.maths.lth.se/matematiklth/personaVcaUe/). 
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# estimate E uith RANSAC 
ttiodel = sfm.RansacModelO 

Ejinliers = sfm.F_from_ransac(xln,x2n,ttiodel) 

# compute camera matrices (P2 will be list of four Solutions) 

P1 = array([[l,0,0,0],[0,l,0,0],[0,0,l,0]]) 

P2 = sfm.compute_P_from_essential(E) 

The calibration is known, so here we just hardcode the K matrix at the beginning. 
As in earlier examples, we pick out the points that belong to matches. After that, we 
normalize them with K~^ and run the RANSAC estimation with the normalized eight 
point algorithm. Since the points are normalized, this gives us an essential matrix. We 
make sure to keep the index of the inliers, as we will need them. From the essential 
matrix we compute the four possible Solutions of the second camera matrix. 

From the list of camera matrices, we pick the one that has the most scene points in 
front of both cameras after triangulation: 

# pick the solution with points in front of cameras 
ind = 0 

maxres = 0 

for i in range(4): 

# triangulate inliers and compute depth for each camera 
X = sfm.triangulate(xln[:jinliers] ,x2n[:jinliers], Pl,P2[i]) 
dl = dot(Pl,X)[2] 
d2 = dot(P2[i],X)[2] 
if sum(dl>o)+sum(d2>o) > maxres: 
maxres = sum(dl>0)+sum(d2>0) 
ind = i 

infront = (dl>0) & (d2>0) 


# triangulate inliers and remove points not in front of both cameras 
X = sfm.triangulate(xln[inliers],x2n[inliers],Pl,P2[ind]) 

X = X[:,infront] 

We loop through the four Solutions and each time triangulate the 3D points corre- 
sponding to the inliers. The sign of the depth is given by the third value of each image 
point after projecting the triangulated X back to the images. We keep the index with 
the most positive depths and also store a boolean for each point in the best solution so 
that we can pick only the ones that actually are in front. Due to noise and errors in all 
of the estimations done, there is a risk that some points stili are behind one camera, 
even with the correct camera matrices. Once we have the right solution, we triangulate 
the inliers and keep the points in front of the cameras. 

Now we can plot the reconstruction: 

# 3D plot 

from mpl_toolkits.mplot3d import axesBd 
fig = figureO 

ax = fig.gca(projection='3d') 
ax.plot(-X[o],X[l],X[ 2 ],'k.') 
axis('off') 
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The 3D plots with iriplot3d have the first axis reversed compared to our coordinate 
System, so we change the sign. 

We can then plot the reprojection in each view: 

# plot the projectiori of X 
import camera 

# project 3D points 
cami = camera.Camera(Pl) 
cam2 = camera.Camera(P2[ind]) 
xlp = caml.project(X) 

x2p = cam2.project(X) 

# reverse K normalization 
xlp = dot(K,xlp) 

x2p = dot(K,x2p) 


figureO 

imshow(iml) 

grayO 

plot(xlp[o],xlp[l],'o') 
plot(xl[o],xl[l],'r.') 
axis('off') 

figureO 

imshow(im2) 

grayO 

plot(x2p[0],x2p[l], 'o') 
plot(x2[0],x2[l],'r.') 
axis('off') 
show() 

After projecting the 3D points, we need to reverse the initial normalization by multi- 
plying with the calibration matrix. 

The resuit looks like Figure 5-8. As you can see, the reprojected points don’t exactly 
match the original feature locations, but they are reasonably close. It is possible to 
further reline the camera matrices to improve the reconstruction and reprojection, but 
that is outside the scope of this simple example. 


Extensions and More Than Two Views 

There are some steps and further extensions to multiple view reconstructions that we 
cannot cover in a book like this. Here are some of them with references for further 
reading. 

More views 

With more than two views of the same scene, the 3D reconstruction will usually be 
more accurate and more detailed. Since the fundamental matrix only relates a pair of 
views, the process is a little different with many images. 
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Figure 5-8. Example of computing a 3D reconstruction from a pair of images using image matches: 
the two images with feature points shown in black and reprojected reconstructed 3D points shown in 
white (top); the 3D reconstruction (bottom). 


For video sequences, we can use the temporal aspect and match features in consecutive 
frame pairs. The relative orientation needs to be added incrementally from each pair 
to the next (similar to how we added homographies in the panorama example in 
Figure 3-12). This approach usually works well, and tracking can be used to effectively 
find correspondences (see Section 10.4 for more on tracking). One problem is that 
errors will accumulate the more views that are added. This can be fixed with a final 
optimization step; see below. 

With stili images, one approach is to find a Central reference view and compute all 
the other camera matrices relative to that one. Another method is to compute camera 
matrices and a 3D reconstruction for one image pair and then incrementally add new 
images and 3D points; see for example [34]. As a side note, there are ways to compute 
3D and camera positions from three views at the same time (see for example [13]), but 
beyond that an incremental approach is needed. 

Bundie adjustment 

From our simple 3D reconstruction example in Figure 5-8, it is ciear that there will 
be errors in the position of the recovered points and in the camera matrices computed 
from the estimated fundamental matrix. With more views, the errors will accumulate. 
Therefore, a final step in multiple view reconstructions is often to try to minimize 
the reprojection errors by optimizing the position of the 3D points and the camera 
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parameters. This process is called bundle adustment. Details can be found in [13] and 
[35] and a short overview at http://en.wikipedia.org/wiki/Bundle_adjustment. 

Seif-calibration 

In the case of uncalibrated cameras, it is sometimes possible to compute the calibration 
from image features. This process is called self-calibration. There are many different 
algorithms, depending on what assumptions can be made on parameters of the camera 
calibration matrix and depending on what types of image data is available (feature 
matches, parallel lines, planes, etc.). The interested reader can take a look at [13] and 
[26, Chap. 6]. 

As a side note to calibration, there is a useful script, extract_focal.pl, as part of the 
Bundler SfM system {http://phototour.cs.washington.edu/bundler/). This uses a lookup 
table for common cameras and estimates the focal length based on the image EXIF data. 

5.4 Stereo Images 

A special case of multi-view imaging is stereo vision (or stereo imaging), where two 
cameras are observing the same scene with only a horizontal (sideways) displacement 
between the cameras. When the cameras are configured so that the two images have 
the same image plane with the image rows vertically aligned, the image pair is said to 
be rectified. This is common in robotics, and such a setup is often called a stereo rig. 

Any stereo camera setup can be rectified by warping the images to a common plane so 
that the epipolar lines are image rows (a stereo rig is usually constructed to give such 
rectified image pairs). This is outside the scope of this section, but the interested reader 
can find the details in [13, p. 303] or [3, p. 430]. 

Assuming that the two images are rectified, finding correspondences is constrained 
to searching along image rows. Once a corresponding point is found, its depth (Z 
coordinate) can be computed directly from the horizontal displacement as it is inversely 
proportional to the displacement. 



Xi-X, 


where / is the rectified image focal length, b the distance between the camera centers, 
and X/ and x^ the x-coordinate of the corresponding point in the left and right image. 
The distance separating the camera centers is called the baseline. Figure 5-9 illustrates 
a rectified stereo camera setup. 

Stereo reconstruction (sometimes called dense depth reconstruction) is the problem of 
recovering a depth map (or, inversely, a disparity map) where the depth (or disparity) 
for each pixel in the image is estimated. This is a classic problem in computer vision 
and there are many algorithms for solving it. The Middlebury Stereo Vision Page 
{http://vision.middlebury.edu/stereo/) contains a constantly updated evaluation of the 
best algorithms with code and descriptions of many implementations. In the next 
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Figure 5-9. An illustration of a rectified stereo image setup where corresponding points are on the same 
rovj in both images. 


section, we will implement a stereo reconstruction algorithm based on normalized 
cross-correlation. 

Computing Disparity Maps 

In this stereo reconstruction algorithm, we will try a range of displacements and record 
the best displacement for each pixel by selecting the one with the best score according to 
normalized cross-correlation of the local image neighborhood. This is sometimes called 
plane sweeping, since each displacement step corresponds to a plane at some depth. 
While not exactly state of the art in stereo reconstruction, this is a simple method that 
usually gives decent results. 

Normalized cross-correlation can be efficiently computed when applied densely across 
images. This is different from when we applied it between sparse point correspondences 
in Chapter 2. We want to evaluate normalized cross-correlation on a patch (basically a 
local neighborhood) around each pixel. For this case, we can rewrite the NCC around 
a pixel, equation (2.3), as 

,, , , - /xi)(/2(x) - M 2 ) 

ncc(/i, / 2 ) = , 

V Ex(A(x) - Ex(4(x) - M2)^ 

where we skip the normalizing constant in front (it is not needed here) and the sums 
are taken over the pixels of a local patch around the pixel. 

Now, we want this for every pixel in the image. The three sums are over a local patch 
region and can be computed efficiently using image filters, just as we did for blur 
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and derivatives. The function uniforin_filter() in the ndimage.filters module will 
compute the sums over a rectangular patch. 

Heres the function that does the plane sweep and returns the best disparity for each 
pixel. Create a file stereo.py and add the following: 


def plane_sweep_ncc(ini_l,im_r,start,steps,wid): 

Find disparity image using normalized cross-coirelation. """ 

m,n = im_l.shape 

# arrays to hold the different sums 
mean_l = zeros((m,n)) 

mean_r = zeros((m,n)) 
s = zeros((m,n)) 
s_l = zeros((m,n)) 
s_r = zeros((m,n)) 

# array to hold depth planes 
dmaps = zeros((m,n,steps)) 

# compute mean of patch 
filters.uniform_filter(im_l,wid,niean_l) 
filters.uniform_filter(im_r,wid,niean_r) 

# normalized images 
norm_l = im_l - mean_l 
norm_r = im_r - mean_r 

# try different disparities 
for displ in range(steps): 

# move left image to the right, compute sums 

filters.uniform_filter(roll(norm_l,-displ-start)*norm_r,wid,s) # sum nominator 
filters.uniform_filter(roll(norm_l,-displ-start)*roll(norm_l,-displ-start),wid, 

s_l) 

filters.uniform_filter(norm_r*norm_r,wid,s_r) It sum denominator 

# store ncc scores 

dmaps[:,:,displ] = s/sqrt(s_l*s_r) 

# pick best depth for each pixel 
return argmax(dmaps,axis=2) 


First, we need to create some arrays to hold the filtering results as uniforin_filter() 
takes them as input arguments. Then, we create an array to hold each of the planes 
so that we can apply argmax() along the last dimension to find the best depth for each 
pixel. The function iterates over all steps displacements from start. One image is shifted 
using the roll() function, and the three sums of the NCC are computed using filtering. 

Here is a full example of loading images and computing the displacement map using 
this function: 
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import stereo 

im_l = array(Image.open(' scenel.row 3 .col 3 .ppm' ).convert('L'),'f') 
im_r = array(Image.open(' scenel.row 3 .col 4 .ppm' ).convert('L'),'f') 

# starting displacement and steps 
steps = 12 

start = 4 

# lAiidth for ncc 
wid = 9 

res = stereo.plane_sweep_ncc(im_l,im_r,start,steps,wid) 
import scipy.misc 

scipy.misc.imsave('depth.png',res) 


Here we first load a pair of images from the classic “tsukuba” set and convert them 
to grayscale. Next, we set the parameters needed for the plane sweep function, the 
numher of displacements to try, the starting value and the width of the NCC patch. 
You will notice that this method is fairly fast, at least compared to matching features 
with NCC. This is hecause everything is computed using filters. 

This approach also works for other filters. The uniform filter gives all pixels in a square 
patch equal weight, but in some cases other filters for the NCC computation might 
be preferred. Here is one alternative using a Gaussian filter that produces smoother 
disparity maps. Add this to stereo.py. 


def plane_sweep_gauss(im_l,im_r,start,steps,wid): 

""" Find disparity image using normalized cross-correlation 
with Gaussian weighted neigborhoods. """ 

m,n = im_l.shape 

# arrays to hold the different sums 
mean_l = zeros((m,n)) 

mean_r = zeros((m,n)) 
s = zeros((m,n)) 
s_l = zeros((m,n)) 
s_r = zeros((m,n)) 

# array to hold depth planes 
dmaps = zeros((m,n,steps)) 

# compute mean 

filters.gaussian_filter(im_l,wid,0,mean_l) 

filters.gaussian_filter(im_r,wid,0,mean_r) 

# normalized images 
norm_l = im_l - mean_l 
norm r = im r - mean r 
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# try different disparities 
for displ in range(steps): 

# move left image to the right, compute sums 

filters.gaussian_filter(roll(norm_l,-displ-start)*norm_r,wid,0,s) # sum nominator 
filters.gaussian_filter(roll(norm_l,-displ-start)*roll(norm_l,-displ-start),wid, 
0,s_l) 

filters.gaussian_filter(norni_r*norm_r,wid,0,s_r) # sum denominator 

# store ncc scores 

dmaps[:,: ,displ] = s/sqrt(s_l*s_r) 

It pick best depth for each pixel 
return argmax(dmaps,axis=2) 

The code is the same as for the uniform filter with the exception of the extra argument 
in the filtering. We need to pass a zero to gaussian_filter() to indicate that we want a 
Standard Gaussian and not any derivatives (see page 18 for details). 

Use this function the same way as the previous plane sweep function. Figures 5-10 
and 5-11 show some results of these two plane sweep implementations on some Stan¬ 
dard stereo benchmark images. The images are from [29] and [30] and are available at 
http://vision.middlebury.edu/stereo/data/. Here we used the “tsukuba” and “cones” im¬ 
ages and set wid to 9 in the Standard version and 3 for the Gaussian version. The top 
row shows the image pair, bottom left is the Standard NCC plane sweep, and bottom 



Figure 5-10. Example of computing disparity maps from a stereo image pair with normalized cross- 
correlation. 
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Figure 5-11. Example of computing disparity maps from a stereo image pair with normalized cross- 
correlation. 


right is the Gaussian version. As you can see, the Gaussian version is less noisy but also 
has less detail than the Standard version. 

Exercises 

1. Use the techniques introduced in this chapter to verify matches in the White House 
example on page 46 (or even better, an example of your own) and see if you can 
improve on the results. 

2. Compute feature matches for an image pair and estimate the fundamental matrix. 
Use the epipolar lines to do a second pass to find more matches by searching for 
the best match along the epipolar line for each feature. 

3. Take a set with three or more images. Pick one pair and compute 3D points and 
camera matrices. Match features to the remaining images to get correspondences. 
Then take the 3D points for the correspondences and compute camera matrices 
for the other images using resection. Plot the 3D points and the camera positions. 
Use a set of your own or one of the Oxford multi-view sets. 

4. Implement a stereo version that uses sum of squared differences (SSD) instead of 
NCC using filtering the same way as in the NCC example. 
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5. Try smoothing the stereo depth maps using the ROF de-noising from Section 1.5. 
Experiment with the size of the cross-correlation patches to get sharp edges with 
noise levels that can be removed with smoothing. 

6. One way to improve the quality of the disparity maps is to compare the disparities 
from moving the left image to the right and the right image to the left, and only keep 
the parts that are consistent. This will, for example, clean up the parts where there 
is occlusion. Implement this idea and compare the results to the one-directional 
plane sweeping. 

7. The New York Public Library has many old historic stereo photographs. Browse 
the gallery at http://stereo.nypl.org/gallery and download some images you like (you 
can right click and save JPEGs). The images should be rectified already Cut the 
image in two parts and try the dense depth reconstruction code. 
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CHAPTER6 


Clustering Images 


This chapter introduces several clustering methods and shows how to use them for 
clustering images for finding groups of similar images. Clustering can be used for 
recognition, for dividing data sets of images, and for organization and navigation. We 
also look at using clustering for visualizing similarity between images. 

6.1 /T-Means Clustering 

K-means is a very simple clustering algorithm that tries to partition the input data in 
k clusters. -means works by iteratively refining an initial estimate of class centroids as 
follows: 

1. Initialize centroids fii, i = l . . . k, randomly or with some guess. 

2. Assign each data point to the class c,- of its nearest centroid. 

3. Update the centroids as the average of all data points assigned to that class. 

4. Repeat 2 and 3 until convergence. 

/f-means tries to minimize the total within-class variance 

where xy are the data vectors. The algorithm above is a heuristic refinement algorithm 
that works fine for most cases, but it does not guarantee that the best solution is found. 
To avoid the effects of choosing a bad centroid initialization, the algorithm is often 
run several times with different initialization centroids. Then the solution with lowest 
variance V is selected. 

The main drawback of this algorithm is that the number of clusters needs to be decided 
beforehand, and an inappropriate choice will give poor clustering results. The benefits 
are that it is simple to implement, it is parallelizable, and it works well for a large range 
of problems without any need for tuning. 
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The SciPy Clustering Package 

Although simple to implement, there is no need to. The SciPy vector quantization 
package scipy.cluster.vq comes with a A:-means implementation. Heres how to use it. 

Lets start with creating some sample 2D data to illustrate: 

from scipy.cluster.vq import * 

classi = 1.5 * randn(l00,2) 
class 2 = randn(l 00 , 2 ) + array([5,5]) 
features = vstack((classl,class2)) 

This generates two normally distributed classes in two dimensions. To try to cluster 
the points, run k-m&ans with k = 2 like this: 

centroidSjVariance = kmeans(features,2) 

The variance is returned but we don’t really need it, since the SciPy implementation 
computes several runs (default is 20) and selects the one with smallest variance for us. 
Now you can check where each data point is assigned using the vector quantization 
function in the SciPy package: 

code,distance = vq(features,centroids) 

By checking the value of code, we can see if there are any incorrect assignments. To 
visualize, we can plot the points and the final centroids: 

figureO 

ndx = where(code==o)[ 0 ] 

plot(features[ndXjO],features[ndx,l],') 

ndx = where(code==l)[ 0 ] 

plot(features[ndXjO],features[ndx, 1 ],'r.') 

plot(centroids[:, 0 ],centroids[: , 1 ], 'go') 

axis('off') 

show() 

Here the function where () gives the indices for each class. This should give a plot like 
the one in Figure 6T. 

Clustering Images 

Let s try ^-means on the font images described on page 14. The file selectedfontimages.zip 
contains 66 images from this font data set (these are selected for easy overview when 
illustrating the clusters). As descriptor vector for each image, we will use the projection 
coefficients after projecting on the 40 first principal components computed earlier. 
Loading the model file using pickle, projecting the images on the principal components, 
and clustering is then done like this: 

import imtools 
import pickle 

from scipy.cluster.vq import * 

# get list of images 

imlist = imtools.get_imlist('selected_fontimages/') 
imnbr = len(imlist) 
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# load model file 

with open('a_pca_niodes.pkl','rb') as f: 
immean = pickle.load(f) 

V = pickle.load(f) 

# create matrix to store all flattened images 
immatrix = array([array(Image.open(im)).flatten() 

for im in imlist],'f') 

# project on the 40 first PCs 
immean = immean.flatten() 

projected = array([dot(V[:40],immatrix[i]-immean) for i in range(imnbr)]) 

# k-means 

projected = whiten(projected) 
centroids,distortion = kmeans(projected,4) 

code,distance = vq(projected,centroids) 

Same as before, code contains the cluster assignment for each image. In this case, we 
tried k = 4. We also chose to “whiten” the data using SciPys whiten(), normalizing so 
that each feature has unit variance. Try to vary parameters like the numher of principal 
components used and the value of k to see how the clustering results change. The 
clusters can he visualized like this: 

# plot clusters 
for k in range(4): 

ind = where(code==k)[ 0 ] 

figureO 

grayO 

for i in range(minimum(len(ind),40)): 
subplot(4,l0,i+l) 

imshow(immatrix[ind[i]].reshape((25,25))) 
axis('off') 

show() 


* * * # 
* * * ’ 





Figure 6-1. An example ofk-means clustering of 2D points. Class centroids are marked as large rings 
and the predicted classes are stars and dots, respectively. 
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Figure 6-2. An example of k-means clustering with k = 4 of the font images using 40 principal 
components. 


Here we show each cluster in a separate figure window in a grid with maximum 40 
images from the cluster shown. We use the PyLab function subplot() to define the grid. 
A sample cluster resuit can look like the one in Figure 6-2. 

For more details on the ^-means SciPy implementation and the scipy. cluster. vq pack- 
age, see the reference guide http://docs.scipy.org/doc/scipy/reference/cluster.vq.html. 

Visualizing the Images on Principal Components 

To see how the clustering using just a few principal components as above can work, 
we can visualize the images on their coordinates in a pair of principal component 
directions. One way is to project on two components by changing the projection to 

projected = array([dot(V[[0,2]],i™atrix[i]-immean) for i in range(imnbr)]) 

to get only the relevant coordinates (in this case y[[0, 2]] gives the first and third). 
Alternatively project on all components and afterward just pick out the columns you 
need. 

For the visualization, we will use the ImageDraw module in PIL. Assuming that you have 
the projected images and image list as above, the following short script will generate a 
plot like the one in Figure 6-3: 

from PIL import Image, ImageDraw 

# height and width 
h,w = 1200,1200 

# create a nevi image mth a white backgiound 
img = Image.new('RGB',(w,h),(255,255,255)) 
draw = ImageDraw.Draw(img) 

# c/row axis 

draw.line((0,h/2,w,h/2),fill=(255,0,0)) 
draw.line((w/2,0,w/2,h),fill=(255,0,0)) 
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Figure 6-3. The projectiori of the font images on pairs of principal components: thefirst and second 
Principal components (left); the second and third (right). 


# scale coordinates to fit 
scale = abs(projected).max(o) 

scaled = floor(array([ (p / scale) * (w/2-20,h/2-20) + 

(w/2,h/2) for p in projected])) 

# paste thumbnail of each image 
for i in range(imnbr): 

nodeim = Image.open(imlist[i]) 
nodeim.thumbnail((25,25)) 
ns = nodeim.size 

img.paste(nodeim,(scaled[i][o]-ns[o]//2,scaled[i][l]- 

ns[l]//2,scaled[i][o]+ns[o]//2+l,scaled[i][l]+ns[l]//2+l)) 

img.save('pca_font.j pg') 

Here we used the integer or floor division operator //, which returns an integer pixel 
position by removing any values after the decimal point. 

Plots like these illustrate how the images are distributed in the 40 dimensions and can 
be very useful for choosing a good descriptor. Already in just these two-dimensional 
projections the closeness of similar font images is clearly visible. 


Clustering Pixeis 

Before closing this section, we will take a look at an example of clustering individual 
pixeis instead of entire images. Grouping image regions and pixeis into “meaningful” 
components is called image segmentatiori and will be the topic of Chapter 9. Naively 
applying A:-means on the pixel values will not give anything meaningful except in very 
simple images. More sophisticated class models than average pixel color or spatial 
consistency is needed to produce useful results. For now, lets just apply ^-means to 
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the RGB values and worry about solving segmentation problems later (Section 9.2 has 
the details). 

The following code sample takes an image, reduces it to a lower resolution version with 
pixels as mean values of the original image regions (taken over a square grid of size 
steps X steps), and clusters the regions using A:-means: 


from scipy.cluster.vq import * 
from scipy.misc import imresize 

steps = 50 # image is divided in steps*steps region 
im = array(Image.open( 'empire.jpg' )) 

dx = im.shape[o] / steps 
dy = im.shape[l] / steps 

# computa color features for each region 
features = [] 

for X in range(steps) : 
for y in range(steps): 

R = mean(im[x*dx: (x+l)’''dx,y’''dy: (y+l)*dy,o]) 

G = mean(im[x*dx: (x+l)’''dx,y’''dy: (y+l)*dy,l]) 

B = mean(im[x*dx: (x+l)’''dx,y*dy: (y+l)*dy,2]) 
features.append([R,G,B]) 

features = array(features,'f') # make into array 

# cluster 

centroidSjVariance = kmeans(features,3) 
code,distance = vq(features,centroids) 

# create image with cluster labeis 
codeim = code.reshape(steps,steps) 

codeim = imresize(codeim,im.shape[:2],interp='nearest') 

figureO 

imshow(codeim) 

show() 


The input to /:-means is an array with steps*steps rows, each containing the R, G, and 
B mean values. To visualize the resuit, we use SciPys imresize() function to show the 
steps*steps image at the original image coordinates. The parameter interp speciiies what 
type of interpolation to use; here we use nearest neighbor so we dont introduce new 
pixel values at the transitions between classes. 

Figure 6-4 shows results using 50 x 50 and 100 x 100 regions for two relatively simple 
example images. Note that the ordering of the /:-means labeis (in this case the colors in 
the resuit images) is arbitrary As you can see, the resuit is noisy despite down-sampling 
to only use a few regions. There is no spatial consistency and it is hard to separate 
regions, like the boy and the grass in the lower example. Spatial consistency and better 
separation will be dealt with later, together with other image segmentation algorithms. 
Now lets move on to the next basic clustering algorithm. 
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Figure 6-4. Clustering ofpixels based on their color value using k-means: original image (left); cluster 
resuit with k = 3 and 50 x 50 resolution (middle); cluster resuit with k = 3 and 100 x 100 resolution 
(right). 


6.2 Hierarchical Clustering 

Hierarchical clustering (or agglomerative clustering) is another simple but powerful 
clustering algorithm. The idea is to build a similarity tree based on pairwise distances. 
The algorithm starts with grouping the two closest objects (based on the distance 
between feature vectors) and creates an “average” node in a tree with the two objects 
as children. Then the next closest pair is found among the remaining objects but then 
also including any average nodes, and so on. At each node, the distance between the 
two children is also stored. Clusters can then be extracted by traversing this tree and 
stopping at nodes with distance smaller than some threshold that then determines the 
cluster size. 

Hierarchical clustering has several benefits. For example, the tree structure can be used 
to visualize relationships and show how clusters are related. A good feature vector will 
give a nice separation in the tree. Another benefit is that the tree can be reused with 
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different cluster thresholds without having to recompute the tree. The drawback is that 
one needs to choose a threshold if the actual clusters are needed. 

Lets see what this looks like in code.^ Create a file hcluster.py and add the following 
code (inspired by the hierarchical clustering example in [31]): 

from itertools import combinations 
class ClusterNode(object): 

def _init_(self,vec, left,right,distance=0.0,count=l): 

self.left = left 
self.right = right 
self.vec = vec 
self.distance = distance 

self.count = count # only used for ueighted average 

def extract_clusters(self,dist): 

Extract list of sub-tree clusters from 
hcluster tree with distancecdist. """ 
if self.distance < dist: 
return [self] 

return self .left.extract_clusters(dist) + self .right.extract_clusters(dist) 

def get_cluster_elenients(self): 

Return ids for elements in a cluster sub-tree. 
return self.left.get_cluster_elements() + self.right.get_cluster_elements() 

def get_height(self): 

""" Return the height of a node, 
height is sum of each branch. """ 
return self.left.get_height() + self.right.get_height() 

def get_depth(self): 

Return the depth of a node, depth is 
max of each child plus own distance. """ 
return niax(self.left.get_depth(), self.right.get_depth()) + self.distance 


class ClusterLeafNode(object): 

def _init_(self,vec,id): 

self.vec = vec 
self.id = id 

def extract_clusters(self,dist): 
return [self] 

def get_cluster_elements(self): 
return [self.id] 

def get_height(self): 
return 1 


^ There is also a version of hierarchical clustering in the SciPy clustering package that you can look at if you 
like. We will not use that version here since we want a class that can draw dendrograms and visualize clusters 
using image thumbnails. 
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def get_depth(self): 
return 0 


def L2dist(vl,v2): 
return sqrt(sum((vl-v2)**2)) 


def Lldist(vl,v2): 
return sum(abs(vl-v2)) 


def hcluster(features,distfcn=L2dist): 

Cluster the rows of features using 
hierarchical clusteiing. """ 

# cache of distance calculations 
distances = {} 

# initialize with each row as a cluster 

node = [ClusterLeafNode(array(f),id=i) for i,f in enumerate(features)] 

while len(node)>l: 
closest = float('Inf') 

# loop through ei/ery pair looking for the smallest distance 
for ni,nj in combinations(node,2): 

if (ni,nj) not in distances: 
distances[ni,nj] = distfcn(ni.vec,nj.vec) 

d = distances[ni,nj] 
if dcclosest: 
closest = d 
lowestpair = (ni,nj) 
ni,nj = lowestpair 

# average the two clusters 
new_vec = (ni.vec + nj.vec) / 2.0 

# create new node 

new_node = Clusterfilode(new_vec,left=ni,right=nj,distance=closest) 

node.remove(ni) 

node.remove(nj) 

node.append(new_node) 

return node[o] 

We created two classes for tree nodes, ClusterNode and ClusterLeafNode, to be used to 
create the cluster tree. The function hcluster() builds the tree. First, a list of leaf nodes 
is created, then the closest pairs are iteratively grouped together based on the distance 
measure chosen. Returning the final node will give you the root of the tree. Running 
hcluster () on a matrix with feature vectors as rows will create and return the cluster tree. 

The choice of distance measure depends on the actual feature vectors. Here, we used 
the Euclidean (Lj) distance (a function for distance is also provided), but you can 
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create any function and use that as parameter to hclusterQ. We also used the average 
feature vector of ali nodes in a sub-tree as a new feature vector to represent the sub-tree 
and treat each sub-tree as objects. There are other choices for deciding which two nodes 
to merge next, such as using single linking (use the minimum distance between objects 
in two sub-trees) and complete linking (use the maximum distance between objects in 
two sub-trees). The choice of linking will affect the type of clusters produced. 

To extract the clusters from the tree, you need to traverse the tree from the top until a 
node with distance value smaller than some threshold is found. This is easiest done 
recursively. The ClusterNode method extract_clusters() handles this by returning a 
list with the node itself if below the distance threshold, and otherwise calling the child 
nodes (leaf nodes always return themselves). Calling this function will return a list 
of sub-trees containing the clusters. To get the leaf nodes for each cluster sub-tree that 
contains the obj ect ids, traverse each sub-tree and return a list of leaves using the method 
get_cluster_elements(). 

Let s try this on a simple example to see it ali in action. First create some 2D data points 
(same as for A:-means above): 

classi = 1.5 * randn(l00,2) 
class 2 = randn(l 00 , 2 ) + array([5,5]) 
features = vstack((classl,class2)) 

Cluster the points and extract the clusters from the list using some threshold (here we 
used 5) and print the clusters in the console: 

import hcluster 

tree = hcluster.hcluster(features) 

clusters = tree.extract_clusters(5) 

print 'number of clusters', len(clusters) 
for c in clusters: 
print c.get_cluster_elements() 

This should give a printout similar to this: 

number of clusters 2 

[184, 187, 196, 137, 174, 102, 147, 145, 185, 109, 166, 152, 173, 180, 128, 163, 141, 
178, 151, 158, 108, 182, 112, 199, 100, 119, 132, 195, 105, 159, 140, 171, 191, 164, 

130, 149, 150, 157, 176, 135, 123, 131, 118, 170, 143, 125, 127, 139, 179, 126, 160, 

162, 114, 122, 103, 146, 115, 120, 142, 111, 154, 116, 129, 136, 144, 167, 106, 107, 

198, 186, 153, 156, 134, 101, 110, 133, 189, 168, 183, 148, 165, 172, 188, 138, 192, 

104, 124, 113, 194, 190, I6l, 175, 121, 197, 177, 193, 169, 117, 155] 

[56, 4, 47, 18, 51, 95, 29, 91, 23, 80, 83, 3, 54, 68, 69, 5, 21, 1, 44, 57, 17, 90, 
30, 22, 63, 41, 7, 14, 59, 96, 20, 26, 71, 88, 86, 40, 27, 38, 50, 55, 67, 8, 28, 79, 

64, 66, 94, 33, 53, 70, 31, 81, 9, 75, 15, 32, 89, 6, 11, 48, 58, 2, 39, 61, 45, 

65, 82, 93, 97, 52, 62, 16, 43, 84, 24, 19, 74, 36, 37, 60, 87, 92, I8l, 99, 10, 49, 

12, 76, 98, 46, 72, 34, 35, 13, 73, 78, 25, 42, 77, 85] 

Ideaiiy you shouid get two ciusters, but depending on the actuai data you might get 

three or even more. In this simpie exampie of ciustering 2D points, one ciuster shouid 
contain vaiues iower than 100 and the other vaiues 100 and above. 
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Clustering Images 

Lets look at an example of clustering images based on their color content. The file 
sunsets.zip contains 100 images downloaded from Flickr using the tag “sunset” or 
“sunsets.” For this example, we will use a color histogram of each image as feature 
vector. This is a bit crude and simple but good enough for illustrating what hierarchical 
clustering does. Try running the following code in a folder containing the sunset images: 

import os 
import hcluster 

# create a list of images 
path = 'flickr-sunsets/' 

imlist = [os.path.join(path,f) for f in os.listdir(path) if f.endswith('.jpg')] 

# extract feature vector (8 bins per color channel) 
features = zeros([len(imlist), 512]) 

for i,f in enumerate(imlist): 
im = array(Image.open(f)) 

# multi-dimensional histogram 

h,edges = histogramdd(im.reshape(-l,3),8,normed=True, 
range=[(0,255),(0,255),(0,255)]) 
features[i] = h.flatten() 

tree = hcluster.hcluster(features) 

Flere we take the R, G, and B color channels as vectors and feed them into NumPys 
histogramddO, which computes multi-dimensional histograms (in this case three di- 
mensions). We chose 8 bins in each color dimension (8 x 8 x 8), which, after flattening, 
gives 512 bins in the feature vector. We use the “normed=True” option to normalize the 
histograms in case the images are of different size and set the range to 0 . . . 255 for 
each color channel. The use of reshape() with one dimension set to —1 will automat- 
ically determine the correct size, and thereby create an input array to the histogram 
computation consisting of the RGB color values as rows. 

To visualize the cluster tree, we can draw a dendrogram. A dendrogram is a diagram 
that shows the tree layout. This often gives useful information on how good a given 
descriptor vector is and what is considered similar in a particular case. Add the following 
code to hduster.py-. 

from PIL import Image,ImageDraw 

def draw_dendrogram(node,imlist,filename=' clusters.jpg' ): 

Draw a cluster dendrogram and save to a file. """ 

# height and width 

rows = node. get_height 0*20 
cois = 1200 

# scale factor for distances to fit image width 
s = float(cols-150)/node.get_depth() 

# create image and draw object 

im = Image.new('RGB',(cois,rows),( 255 , 255 , 255 )) 
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draw = ImageDraw.Draw(im) 

# initial line for start of tree 

draw.line((0,rows/2,20,rows/2),fill=( 0 , 0 , 0 )) 

# draw the nodes recursively 

node.draw(draw,20,(rows/2),s,imlist,im) 

im.save(filename) 

im.showO 

Here the dendrogram drawing uses a draw() method for each node. Add this method 
to the ClusterNode class: 

def draw(self,draw,x,y,s,imlist,im): 

Draw nodes recursively with image 
thumbnails for leaf nodes. """ 

hl = int(self.left.get_height()*20 / 2 ) 
h2 = int(self.right.get_height()*20 / 2 ) 
top = y-(hl+h2) 
bottom = y+(hl+h2) 

# vertical line to children 

draw.line((x,top+hl,x,bottom-h2),fill=( 0 , 0 , 0 )) 

# horizontal lines 
11 = self.distance*s 

draw. line((x,top+hl,x+ll,top+hl),fill=( 0 , 0 , 0 )) 
draw.line((x,bottom-h2,x+ll,bottom-h2),fill=( 0 , 0 , 0 )) 

# draw left and right child nodes recursively 
self.left.draw(draw,x+11,top+hl,s,imlist,im) 
self.right.draw(draw,x+ll,bottom-h2,s,imlist,im) 

The leaf nodes have their own special method to draw thumbnails of the actual images. 
Add this to the ClusterLeafNode class: 

def draw(self,draw,x,y,s,imlist,im): 
nodeim = Image.open(imlist[self.id]) 
nodeim.thumbnail([20,20]) 
ns = nodeim.size 

im.paste(nodeim,[int(x),int(y-ns[l]//2),int(x+ns[o]),int(y+ns[l]-ns[l]//2)]) 

The height of a dendrogram and the sub parts are determined by the distance values. 
These need to be scaled to fit inside the chosen image resolution. The nodes are drawn 
recursively with the coordinates passed down to the level below. Leaf nodes are drawn 
with small thumbnail images of 20 x 20 pixels. Two helper methods are used to get the 
height and width of the tree, get_height() and get_depth(). 

The dendrogram is drawn like this: 

hcluster.draw_dendrogram(tree,imlist,filename='sunset.pdf') 

The cluster dendrogram for the sunset images is shown in Figure 6-5. As can be seen, 
images with similar color are close in the tree. Three example clusters are shown in 
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Figure 6-6. Example clusters from the 100 images of sunsets obtained with hierarchical clustering 
using a threshold set to 23% of the maximum node distance in the tree. 


Figure 6-6. The clusters in this example are extracted as follows: 

# visualize clusters with some (arbitrary) threshold 
clusters = tree.extract_clusters(o.23*tree.distance) 

# plot images for clusters with more than 3 elements 
for c in clusters: 

elements = c.get_cluster_elements() 
nbr_elements = len(elements) 
if nbr_elements>3: 
figureO 

for p in range(minimum(nbr_elements,20)): 
subplot(4,5,p+l) 

im = array(Image.open(imlist[elements[p]])) 

imshow(im) 

axis('off') 

show() 

As a final example, we can create a dendrogram for the font images: 
tree = hcluster.hcluster(projected) 

hcluster.draw_dendrogram(tree,imlist,filename='fonts.j pg') 

where projected and imlist refer to the variables used in the /:-means example in 
Section 6.1. The resulting font images dendrogram is shown in Figure 6-7. 

6.3 Spectral Clustering 

Spectral clustering methods are an interesting type of clustering algorithm that have a 
different approach compared to fc-means and hierarchical clustering. 

A similarity matrix (or affinity matrix, or sometimes distance matrix) for n elements (for 
example images) is an n x n matrix with pair-wise similarity scores. Spectral clustering 
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Figure 6-7 An example of hierarchical dustering of 66 selected font images using 40 principal 
components as feature vector. 


gets its name from the use of the spectrum of a matrix constructed from a similarity 
matrix. The eigenvectors of this matrix are used for dimensionality reduction and then 
clustering. 

One of the benefits of spectral clustering methods is that the only input needed is this 
matrix and it can be constructed from any measure of similarity you can think of 
Methods like A:-means and hierarchical clustering compute the mean of feature vectors, 
and this restricts the features (or descriptors) to vectors (in order to be able to compute 
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the mean). With spectral methods, there is no need to have feature vectors of any kind, 
just a notion of “distance” or “similarity” 

Heres how it works. Given an x n similarity matrix S with similarity scores s^j, we can 
create a matrix, called the Laplacian matrix^, 

L = I - 


where I is the identity matrix and D is the diagonal matrix containing the row sums 
of 5’, D = diagfd,), di = Sij. The matrix used in the construction of the 

Laplacian matrix is then 


0 - 1/2 


1 

1 

\fdi 




In order to make the presentation clearer, let s use low values of Sij for similar elements 
and require Sij > 0 (the term distance matrix is perhaps more fitting in this case). 

The clusters are found by computing the eigenvectors of L and using the k eigenvectors 
corresponding to the k largest eigenvalues to construet a set of feature vectors (remem- 
ber that we may not have had any to start with!). Create a matrix with the k eigenvectors 
as columns. The rows will then be treated as new feature vectors (of length k). These 
new feature vectors can then be clustered, using, for example, L-means to produce the 
final clusters. In essence, what the algorithm does is to transform the original data into 
new feature vectors that can be more easily clustered (and in some cases using cluster 
algorithms that could not be used in the first place). 

Enough about the theory; lets see what it looks like in code when applied to a real 
example. Again, we take the font images used in the fc-means example above (and 
introduced on page 14): 

from scipy.cluster.vq import * 


n = len(projected) 


# compute distance matrix 

S = array([[ sqrt(sum((projected[i]-projected[j])’''*2)) 
for i in range(n) ] for j in range(n)], 'f') 

It create Laplacian matrix 
rowsum = sum(S,axis=o) 

D = diag(l / sqrt(rowsum)) 

I = identity(n) 

L = I - dot(D,dot(S,D)) 


^ Sometimes L = D is used as the Laplacian matrix instead, but the choice doesn’t really matter 

since it only changes the eigenvalues, not the eigenvectors. 
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# compute eigenvectors of L 
U,sigma,V = linalg.svd(L) 

k = 5 

# create featuie vector from k first eigenvectors 

# by stacking eigenvectors as columns 
features = array(V[:k]).T 

# k-means 

features = whiten(features) 

centroids,distortiori = kmeans(features,k) 

code,distance = vq(features,centroids) 

# plot clusters 
for c in range(k): 

ind = where(code==c)[ 0 ] 
figureO 

for i in range(minimum(len(ind),39)): 
im = Image.open(path+imlist[ind[i]]) 
subplot(4,l0,i+l) 
imshow(array(im)) 
axis('equal') 
axis('off') 

show() 

In this case, we just create S using pair-wise Euclidean distances and compute a Stan¬ 
dard ^-means clustering on the k eigenvectors {k = Sin this parti cular case). Remember 
that the matrix V contains the eigenvectors sorted with respect to the eigenvalues. 
Finally, the clusters are plotted. Figure 6-8 shows the clusters for an example run (re¬ 
member that the A:-means step might give different results each run). 

We can also try this on an example where we dont have any feature vectors or any 
striet definition of similarity The geotagged Panoramio images on page 44 were linked 
based on how many matehing local descriptors were found between them. The matrix 
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Figure 6-8. Spectral clustering of font images using the eigenvectors of the Laplacian matrix. 
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on page 48 is a similarity matrix with scores equal to the number of matching features 
(without any normalization). With imlist containing the filenames of the images and 
the similarity matrix saved to a file using NumPys savetxt(), we only need to modify the 
first rows of the code above to 

n = len(imlist) 

# load the similarity matrix and reformat 
S = loadtxt(' panoramio_matches.txt ') 

S = 1 / (S + le-6) 

where we invert the scores to have low values for similar images (so we don’t have to 
modify the code above). We add a small number to avoid division with zero. The rest 
of the code you can leave as is. 

Choosing ^ is a bit tricky in this case. Most people would consider there to be only two 
classes (the two sides of the White House) and then some junk images. With k = 2, 
you get something like Figure 6-9, with one large cluster of images of one side and the 
other cluster containing the other side plus all the junk images. Picking a larger value 
of k, like k = 10, gives several clusters with only one image (hopefully the junk images) 
and some real clusters. An example run is shown in Figure 6-10. In this case, there were 
only two actual clusters, each containing images of one side of the White House. 



Figure 6-9. Spectral dustering of geotagged images of the White Flouse with k — 2 and the similarity 
scores as the number of matching local descriptors. 
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Figure 6-10. Spectral dustering of geotagged images of the White House with k—lO and the similarity 
scores as the number ofmatching local descriptors. Only the dusters with more than one image shown. 


There are many different versions and alternatives to the algorithm presented here, each 
of them with its own idea of how to construet the matrix L and what to do with the 
eigenvectors. For further reading on spectral clustering and the details of some common 
algorithms, see for example the review paper [37]. 


Exercises 

1. Hierarchical k-means is a clustering method that applies ^-means recursively to the 
clusters to create a tree of incrementally refined clusters. In this case, each node in 
the tree will branch to k child nodes. Implement this and try it on the font images. 

2. Using the hierarchical k-m&ans from the previous exercise, make a tree visualization 
(similar to the dendrogram for hierarchical clustering) that shows the average image 
for each cluster node. Tip: You can take the average PCA coefficients feature vector 
and use the PCA basis to synthesize an image for each feature vector. 

3. By modifying the class used for hierarchical clustering to include the number of 
images below the node, you have a simple and fast way of finding similar (tight) 
groups of a given size. Implement this small change and try it out on some real 
data. How does it perform? 

4. Experiment with using single and complete linking for building the hierarchical 
cluster tree. How do the resulting clusters differ? 

5. In some spectral clustering algorithms the matrix D~^S is used instead of L. Try 
replacing the Laplacian matrix with this and apply this on a few different data sets. 

6. Download some image collections from Flickr searches with different tags. Extract 
the RGB histogram like you did for the sunset images. Cluster the images using one 
of the methods from this chapter. Can you separate the classes with the clusters? 
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CHAPTER7 


Searching Images 


This chapter shows how to use text mining techniques to search for images based on 
their visual content. The basic ideas of using visual words are presented and the details 
of a complete setup are explained and tested on an example image data set. 

7.1 Content-Based Image Retrieval 

Content-based image retrieval (CBIR) deals with the problem of retrieving visually 
similar images from a (large) database of images. This can be images with similar color, 
similar textures, or similar objects or scenes: basically any Information contained in the 
images themselves. 

For high-level queries, like finding similar objects, it is not feasible to do a full com- 
parison (for example using feature matching) between a query image and all images in 
the database. It would simply take too much time to return any results if the database 
is large. In the last couple of years, researchers have successfully introduced techniques 
from the world of text mining for CBIR problems, making it possible to search millions 
of images for similar content. 

Inspiration from Text Mining—The Vector Space ModeI 

The vector space model is a model for representing and searching text documents. As 
we will see, it can be applied to essentially any kind of objects, including images. The 
name comes from the fact that text documents are represented with vectors that are 
histograms of the word frequencies in the text.^ In other words, the vector will contain 
the number of occurrences of every word (at the position corresponding to that word) 
and zeros everywhere else. This model is also called a bag-of-word representation, since 
order and location of words is ignored. 


^ Often you see “term” used instead of “word”; the meaning is the same. 
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Documents are indexed by doing a word count to construet the document histogram 
vector V, usually with common words like “the,” “and,” “is,” etc., ignored. These 
common words are called stop words. To compensate for document length, the vectors 
can be normalized to unit length by dividing with the total histogram sum. The 
individual components of the histogram vector are usually weighted according to the 
importance of each word. Usually, the importance of a word increases proportionally 
to how often it appears in the document, but decreases if the word is common in all 
documents in a data set (or “corpus”). 


The most common weighting is tf-idf weighting {term frequency-inverse document fre- 

quency) where the term frequency of a word w in document d is 

. n,„ 

tf„ 




Eu 


where is the number of occurrences of w in d. To normalize, this is divided by the 
total number of occurrences of all words in the document. 


The inverse document frequency is 

idf.„ 


= log 


\D\ 
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where |Z)| is the number of documents in the corpus D and the denominator the 
number of documents d mD containing w. Multiplying the two gives the tf-idf weight, 
which becomes one of the elements in v. You can read more about tf-idf at http://en 
. wikipedia. org/wiki/Tf-idf. 


This is really all we need at the moment. Lets see how to carry this model over to 
indexing and searching images based on their visual content. 


7.2 Visual Words 

To apply text mining techniques to images, we first need to create the visual equivalent 
of words. This is usually done using local descriptors like the SIFT descriptor in 
Section 2.2. The idea is to quantize the descriptor space into a number of typical 
examples and assign each descriptor in the image to one of those examples. These typical 
examples are determined by analyzing a training set of images and can be considered as 
visual words. The set of all words is then a visual vocabulary (sometimes called a visual 
codebook). This vocabulary can be created specifically for a given problem or type of 
image or just to try to represent visual content in general. 

The visual words are constructed using some clustering algorithm applied to the feature 
descriptors extracted from a (large) training set of images. The most common choice is 
/:-means,^ which is what we will use here. Visual words are nothing but a collection of 
vectors in the given feature descriptor space; in the case of ^-means, they are the cluster 
centroids. Representing an image with a histogram of visual words is then called a bag- 
of-visual-words model. 


^ Or in the more advanced cases, hierarchical /:-means. 
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Lets introduce an example data set and use that to illustrate the concept. The file 
firstl000.zip contains the first 1000 images from the University of Kentucky object 
recognition benchmark set (also known as “ukbench”). The full set, reported bench- 
marks, and some supporting code can be found at http:llwww.vis.uky. edul-stewel 
ukbench/. The ukbench set contains many sets of four images, each of the same scene 
or object (stored consecutively so that 0 ... 3, 4 ... 7, etc., belong together). Figure 7-1 
shows some examples from the data set. Appendix A has the details on the set and how 
to get the data. 




mii 



Figure 7-1. Some examples of images from the ukbench (University of Kentucky object recognition 
benchmark) data set. 
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Creating a Vocabulary 

To create a vocabulary of visual words we first need to extract descriptors. Here we 
will use the SIFT descriptor. Running the following lines of code, with imlist, as usual, 
containing the filenames of the images, 

nbr_images = len(imlist) 

featlist = [ imlist[i][:- 3 ] + 'sift' for i in range(nbr_images)] 

for i in range(nbr_images): 

sift.process_image(imlist[i],featlist[i]) 

will give you descriptor files for each image. Create a file vocabulary.py and add the 
following code for a vocabulary class and a method for training a vocabulary on some 
training image data: 

from scipy.cluster.vq import * 
import vlfeat as sift 


class Vocabulary(object): 

def _init_(self,name): 

self.name = name 
self.voc = [] 
self.idf = [] 
self.trainingdata = [] 
self.nbr words = 0 


def train(self,featurefiles,k=100,subsampling=10): 

Train a vocabulary from features in files listed 
in featurefiles using k-means with k number of words. 
Subsampling of training data can be used for speedup. """ 

nbr_images = len(featurefiles) 

# read the features from file 
descr = [] 

descr.append(sift.read_features_from_file(featurefiles[o]) [ 1 ]) 
descriptors = descr[o] ttstack all features for k-means 
for i in arange(l,nbr_images): 

descr.append(sift.read_features_froni_file(featurefiles[i])[ 1 ]) 
descriptors = vstack((descriptors,descr[i])) 

# k-means: last number determines number of runs 

self.voCjdistortion = kmeans(descriptors[::subsampling,:],k,l) 
self.nbr_words = self.voc.shape[o] 

# go tbrougb all training images and project on vocabulary 
imwords = zeros((nbr_images,self.nbr_words)) 

for i in range( nbr_images ): 
imwords[i] = self.project(descr[i]) 

nbr_occurences = sum( (imwords > o)*l ,axis=o) 

self.idf = log( (l.O*nbr_images) / (l.O*nbr_occurences+l) ) 
self.trainingdata = featurefiles 
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def project(self,descriptors): 

Project descriptors on the vocabulary 
to create a histogram of words. """ 

# histogram of image words 
imhist = zeros((self.nbr_words)) 
words,distance = vq(descriptors,self.voc) 
for w in words: 
imhist[w] += 1 

return imhist 

The class Vocabulary contains a vector of word cluster centers voc together with the 
idf values for each word. To train the vocabulary on some set of images, the method 
train() takes a list of .sift descriptor files and k, the desired number of words for the 
vocabulary There is also an option of subsampling the training data for the ^-means 
step, which will take a long time if too many features are used. 

With the images and the feature files in a folder on your computer, the following code 
will create a vocabulary of length k ~ 1000 (again assuming that imlist contains a list 
of filenames for the images): 

import pickle 
import vocabulary 

nbr_images = len(imlist) 

featlist = [ imlist[i][:- 3 ] + 'sift' for i in range(nbr_images) ] 

voc = vocabulary.Vocabulary('ukbenchtest') 
voc.train(featlist,1000, 10 ) 

# saving vocabulary 

with open('vocabulary.pkl', 'wb') as f: 
pickle.dump(voc,f) 

print 'vocabulary is:', voc.name, voc.nbr_words 

The last part also saves the entire vocabulary object for later use using the pickle 
module. 

7.3 Indexing Images 

Before we can start searching, we need to set up a database with the images and their 
visual word representations. 


SettingUpthe Database 

To start indexing images, we first need to set up a database. Indexing images in this 
context means extracting descriptors from the images, converting them to visual words 
using a vocabulary, and storing the visual words and word histograms with Information 
about which image they belong to. This will make it possible to query the database 
using an image and get the most similar images back as search resuit. 
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Table 7-1. A simple database schema for storing images and visual words. 


imlist 

imwords 

imhistograms 

rowid 

imid 

imid 

filename 

wordid 

histogram 


vocname 

vocname 


Here we will use SQLite as database. SQLite is a database that Stores everything in a 
single file and is very easy to set up and use. We are using it here since it is the easiest way 
to get started without having to go into database and server configurations and other 
details way outside the scope of this book. The Python version, pysqlite, is available 
from http://code.google.eom/p/pysqlite/ and also through many package repositories on 
Mac and Linux. SQLite uses the SQL query language so the transition should be easy 
if you want to use another database. 

To get started, we need to create tables and indexes and an indexer class to write image 
data to the database. First, create a file imagesearchpy and add the following code: 

import pickle 

from pysqlite2 import dbapi2 as sqlite 
class Indexer(object): 

def _init_(self,db,voc): 

Initialize uith the name of the database 
and a vocabulary object. """ 

self.con = sqlite.connect(db) 
self.voc = voc 


def _del_(self): 
self .con.closeO 


def db_commit(self): 
self .con.commitO 

First of all, we need Pickle for encoding and decoding these arrays to and from strings. 
SQLite is imported from the pysqlite2 module (see Appendix A for installation details). 
The Indexer class connects to a database and Stores a vocabulary object upon creation 

(where the_init_() method is called). The_ dei _() method makes sure to close the 

database connection and db_commit() writes the changes to the database file. 

We only need a very simple database schema of three tables. The table imlist con- 
tains the filenames of all indexed images, and imwords contains a word index of the 
words, which vocabulary was used, and which images the words appear in. Finally 
imhistograms contains the full word histograms for each image. We need those to com¬ 
pare images according to our vector space model. The schema is shown in Table 7-1. 

The following method for the Indexer class creates the tables and some useful indexes 
to make searching faster: 
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def create_tables(self): 

Create the database tables. """ 

self.con.execute('create table imlist(filename)') 
self.con.execute('create table imwords(imid,wordid,vocname)') 
self.con.execute('create table imhistograms(iniid,histogram,vocname)') 
self.con.execute('create index im_idx on imlist(filename)') 
self.con.execute('create index wordid_idx on imwords(wordid)') 
self.con.execute('create index imid_idx on imwords(imid)') 
self.con.execute('create index imidhist_idx on imhistograms(imid)') 
self.db_commit() 


Adding Images 

With the database tables in place, we can start adding images to the index. To do this, we 
need a method add_to_index() for our Indexer class. Add this method to imagesearch.py. 

def add_to_index(self,imname,descr): 

""" Take an image with feature desciiptois, 
project on vocabulary and add to database. """ 

if self.is_indexed(imname): return 
print 'indexing', imname 

# get the imid 

imid = self.get_id(imname) 

# get the words 

imwords = self.voc.project(descr) 
nbr_words = imwords.shape[o] 

# link each word to image 
for i in range(nbr_words): 

word = imwords[i] 

# wordid is the word number itself 

self.con.execute("insert into imwords(imid,wordid,vocname) 
values (?,?,?)", (imid,word,self.voc.name)) 

# store word histogram for image 

# use pickle to encode NumPy arrays as strings 

self.con.execute("insert into imhistograms(imid,histogram,vocname) 
values (?,?,?)", (imid,pickle.dumps(imwords),self.voc.name)) 

This method takes the image filename and a NumPy array with the descriptors found 
in the image. The descriptors are projected on the vocabulary and inserted in imwords 
(word by word) and imhistograms. We used two helper functions, is_indexed(), which 
checks if the image has been indexed already and get_id(), which gives the image id 
for an image filename. Add these to imagesearch.py: 

def is_indexed(self,imname): 

""" Returns True if imname has been indexed. """ 

im = self.con.execute("select rowid from imlist where 
filename='%s"’ % imname).fetchone() 
return im != None 
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def get_id(self,imname): 

Cet an entry id and add if not present. 

cur = self.con.execute( 

"select rowid from imlist where filename='%s"' % imname) 
res=cur.fetchone() 
if res==None: 
cur = self.con.execute( 

"insert into imlist(filename) values ('%s')" % imname) 
return cur.lastrowid 
else: 

return res[o] 

Did you notice that we used Pickle in add_to_index()? Databases like SQLite do not 
have a Standard type for storing objects or arrays. Instead, we can create a string 
representation using Pickles dumps() function and write the string to the database. 
Consequently we need to un-pickle the string when reading from the database. More 
on that in the next section. 

The following code example will go through the ukbench sample images and add them 
to our index. Here we assume that the lists imlist and featlist contain the filenames of the 
images and the descriptor files and that the vocabulary you trained earlier was pickled 
to a file vocabulary.pkl. 

import pickle 
import sift 
import imagesearch 

nbr_images = len(imlist) 

# load vocabulary 

with open('vocabulary.pkl', 'rb') as f: 
voc = pickle.load(f) 

# create indexer 

indx = imagesearch.Indexer('test.db',voc) 
indx.create_tables() 

# go through all images, project features on vocabulary and insert 
for i in range(nbr_images)[:100]: 

locsjdescr = sift.read_features_from_file(featlist[i]) 
indx.add_to_index(imlist[i],descr) 

# commit to database 
indx.db_commit() 

We can now inspect the contents of our database: 

from pysqlite2 import dbapi2 as sqlite 
con = sqlite.connect('test.db') 

print con.execute('select count (filename) from imlist').fetchone() 
print con.execute('select * from imlist').fetchone() 

This prints the following to your console: 

( 1000 ,) 

(u' ukbench00000.jpg ',) 
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If you try fetchall() instead of fetchone() in the last line, you will get a long list of all 
the filenames. 

7.4 Searching the Database for Images 

With a set of images indexed, we can search the database for similar images. Here 
we have used a bag-of-word representation for the whole image, but the procedure 
explained here is generic and can be used to find similar objects, similar faces, similar 
colors, etc. It all depends on the images and descriptors used. 

To handle searches, we introduce a Searcher class to imagesearckpy. 

class Searcher(object): 

def _init_(self,db,voc): 

Initialize with the name of the database. 
self.con = sqlite.connect(db) 
self.voc = voc 


def _del_(self): 
self .con.closeO 

A new Searcher object connects to the database and closes the connection upon dele- 
tion, same as for the Indexer class before. 

If the number of images is large, it is not feasible to do a full histogram comparison 
across all images in the database. We need a way to find a reasonably sized set of 
candidates (where “reasonable” can be determined by search response time, memory 
requirements, etc.). This is where the word index comes into play Using the index, we 
can get a set of candidates and then do the full comparison against that set. 

Using the Index to Get Candidates 

We can use our index to find all images that contain a particular word. This is just a 
simple query to the database. Add candidates_from_word() as a method for the Searcher 
class: 

def candidates_from_word(self,imword): 

Get list of images containing imword. """ 

im_ids = self.con.execute( 

"select distinet imid from imwords where wordid=%d" % imword).fetchall() 
return [i[o] for i in im_ids] 

This gives the image ids for all images containing the word. To get candidates for more 
than one word, for example all the nonzero entries in a word histogram, we can loop 
over each word, get images with that word, and aggregate the lists.^ Here we should 


^ If you don’t want to use all words, try ranking them according to their idf weight and use the ones with highest 
weights. 
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also keep track of how many times each image id appears in the aggregate list, since 
this shows how many words match the ones in the word histogram. This can be done 
with the following Searcher method: 


def candidates_from_histogram(self,imwords): 

Cet list of images with similor words. """ 

# get the word ids 

words = imwords.nonzero()[0] 

# find candidates 
candidatos = [] 
for word in words: 

c = self.candidates_from_word(word) 
candidates+=c 

# take all unigue words and reverse sort on occurrence 
tmp = [(w,candidates.count(w)) for w in set(candidates)] 
tmp.sort(cmp=lambda x,y:cmp(x[l],y[l])) 
tmp.reverseO 

# return sorted listj best matches first 
return [w[o] for w in tmp] 


This method creates a list of word ids from the nonzero entries in a word histogram of 
an image. Candidates for each word are retrieved and aggregated in the list candidates. 
Then we create a list of tuples (word id, count) with the number of occurrences of 
each word in the candidate list and sort this list (in place for efficiency) using sort() 
with a custom comparison function that compares the second element in the tuple. 
The comparison function is declared inline using lambda functions, convenient one- 
line function declarations. The resuit is returned as a list of image ids with the best 
matching image first. 

Consider the following example: 


src = imagesearch.Searcher('test.db') 

locsjdescr = sift.read_features_from_file(featlist[o]) 

iw = voc.project(descr) 

print 'ask using a histogram...' 

print src.candidates_from_histogram(iw)[:10] 

This prints the first 10 lookups from the index and gives the output (this will vary 
depending on your vocabulary): 


ask using a histogram... 

[655, 656, 654, 44, 9, 653, 42, 43, 41, 12] 

None of the top 10 candidates are correct. Don’t worry; we can now take any number of 
elements from this list and compare histograms. As you will see, this improves things 
considerably 
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Querying with an Image 

There is not much more needed to do a full search using an image as query To do word 
histogram comparisons, a Searcher object needs to be able to read the image word 
histograms from the database. Add this method to the Searcher class: 

def get_imhistogram(self,imname): 

Return the word histogram for an image. """ 

im_id = self.con.execute( 

"select rowid from imlist where filename='%s"' % imname).fetchone() 
s = self.con.execute( 

"select histogram from imhistograms where rowid='%d"' % im_id).fetchone() 

# use pickle to decode NumPy arrays from string 
return pickle.loads(str(s[o])) 

Again we use Pickle to convert between string and NumPy arrays, this time with loads(). 

Now we can combine everything into a query method: 

def query(self,imname): 

Find a list of matching images for imname""" 

h = self.get_imhistogram(imname) 

candidates = self.candidates_from_histogram(h) 

matchscores = [] 

for imid in candidates: 

# get the name 

cand_name = self.con.execute( 

"select filename from imlist where rowid=%d" % imid).fetchone() 
cand_h = self.get_imhistogram(cand_name) 
cand_dist = sqrt( sum( (h-cand_h )**2 ) ) ffuse L 2 distance 
matchscores.append( (cand_dist,imid) ) 

# return a sorted list of distances and database ids 
matchscores.sort() 
return matchscores 

This Searcher method takes the filename of an image and retrieves the word histogram 
and a list of candidates (which should be limited to some maximum number if you have 
a large data set). For each candidate, we compare histograms using Standard Euclidean 
distance and return a sorted list of tuples containing distance and image id. 

Lets try a query for the same image as in the previous section: 

src = imagesearch.Searcher('test.db') 

print 'try a query...' 

print src.query(imlist[o])[: 10 ] 

This will again print the top 10 results, including the distance, and should look some- 
thing like this: 

try a query... 

[(O.O, l), (100.03999200319841, 2), (105.45141061171255, 3), (129.47200469599596, 708), 
(129.73819792181484, 707), (132.68006632497588, 4), (139.89639023220005, 10), 
(142.31654858097141, 706 ), (148.1924424523734, 716 ), (148.22955170950223, 663)1 
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Much better. The image has distance zero to itself, and two out of the three images of the 
same scene are on the first two positions. The third image is coming in on position five. 

Benchmarking and Plotting the Results 

To get a feel for how good the search results are, we can compute the number of correct 
images on the top four positions. This is the measure used to report performance for the 
ukbench image set. Heres a function that computes this score. Add it to imagesearch.py 
and you can start optimizing your queries: 

def compute_ukbench_score(src,imlist): 

Returns the average number of correct 
images on the top four results of gueries.""" 

nbr_images = len(imlist) 
pos = zeros((nbr_images, 4 )) 

# get first four results for each image 
for i in range(nbr_images): 

pos[i] = [w[l]-l for w in src.query(imlist[i])[: 4 ]] 

# compute score and return average 

score = array([ (pos[i]// 4 )==(i// 4 ) for i in range(nbr_images)])* 1.0 
return sum(score) / (nbr_images) 

This function gets the top four results and subtracts one from the index returned by 
queryO since the database index starts at one and the list of images at zero. Then we 
compute the score using integer division, using the fact that the correct images are 
consecutive in groups of four. A perfect resuit gives a score of 4, nothing right gives 
a score of 0, and only retrieving the identical images gives a score of 1. Finding the 
identical image together with two of the three other images gives a score of 3. 

Try it out like this: 

imagesearch.compute_ukbench_score(src,imlist) 

Or if you don’t want to wait (it will take some time to do 1000 queries), just use a subset 
of the images: 

imagesearch.compute_ukbench_score(src,imlist[: 100 ]) 

We can consider a score close to 3 as pretty good in this case. The state-of-the-art results 
reported on the ukbench website are just over 3 (note that they are using more images 
and your score will decrease with a larger set). 

Finally a function for showing the actual search results will be useful. Add this function, 

def plot_results(src,res): 

Show images in resuit list 'res'.""" 

figureO 

nbr_results = len(res) 
for i in range(nbr_results): 

imname = src.get_filename(res[i]) 
subplot(l,nbr_results,i+l) 
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imshow(array(Image.open(imname))) 
axis('off') 
show() 

which can be called with any number of search results in the list res. For example, 
like this: 

nbr_results = 6 

res = [w[l] for w in src.query(imlist[o])[;nbr_results]] 
imagesearch.plot_results(src,res) 

The helper function 

def get_filename(self,imid): 

Return the filename for an image id'"'" 

s = self.con.execute( 

"select filename from imlist where rowid='%d"' % imid).fetchone() 
return s[o] 

translates image id to filenames that we need for loading the images when plotting. 
Some example queries on our data set are shown using plot_results() in Figure 7-2. 



Figure 7-2. Some example search results on the ukbench data set. The query image is shown on the 
far left followed by the topfive retrieved images. 
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7.5 Ranking Results Using Geometry 

Lets briefly look at a common way of improving results obtained using a bag-of-visual- 
words model. One of the drawbacks of the model is that the visual words representation 
of an image does not contain the positions of the image features. This was the price paid 
to get speed and scalability 

One way to have the feature points improve results is to re-rank the top results using 
some criteria that takes the features’ geometric relationships into account. The most 
common approach is to fit homographies between the feature locations in the query 
image and the top resuit images. 

To make this efficient, the feature locations can be stored in the database and correspon- 
dences determined by the word id of the features (this only works if the vocabulary is 
large enough so that the word id matches contain mostly correct matches). This would 
require a maj or rewrite of our database and code above and complicate the presentation. 
To illustrate, we will just reload the features for the top images and match them. 

Here is what a complete example of loading all the model files and re-ranking the top 
results using homographies looks like: 


import pickle 
import sift 
import imagesearch 
import homography 

# load image list and vocabulary 
with open('ukbench_imlist.pkl','rb') as f: 
imlist = pickle.load(f) 
featlist = pickle.load(f) 

nbr_images = len(imlist) 

with open('vocabulary.pkl', 'rb') as f: 
voc = pickle.load(f) 


src = imagesearch.Searcher('test.db',voc) 

# index of query image and number of results to return 
q_ind = 50 

nbr_results = 20 

# regular query 

res_reg = [w[l] for w in src.query(imlist[q_ind])[:nbr_results]] 
print 'top matches (regular):', res_reg 


# load image features for query image 

q_locs,q_descr = sift.read_features_from_file(featlist[q_ind]) 
fp = homography.make_homog(q_locs[: 2 ].T) 
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# RANSAC model for homography fitting 
ttiodel = homography.RansacModelO 

rank = {} 

# load image features for resuit 
for ndx in res_reg[l:]: 

locSjdescr = sift.read_features_from_file(featlist[ndx]) 

# get matches 

matches = sift.match(q_descr,descr) 
ind = matches.nonzero()[o] 
ind2 = matches[ind] 

tp = homography.make_homog(locs[:2].T) 

# compute homography, count inliers. if not enough matches return empty list 
try: 

H,inliers = homography.H_from_ransac(fp[:,ind],tp[:,ind2],model,match_theshold=4) 
except: 
inliers = [] 

# store inlier count 
rank[ndx] = len(inliers) 


# sort dictionary to get the most inliers first 

sorted_rank = sorted(rank.items(), key=lambda t: t[l], reverse=True) 
res_geom = [res_reg[o]]+[s[o] for s in sorted_rank] 
print 'top matches (homography):', res_geom 


# plot the top results 
imagesearch.plot_results(src,res_reg[:8]) 
imagesearch.plot_results(src,res_geom[:8]) 


First, the image list, feature list (containing the filenames of the images and SIFT feature 
files, respectively), and the vocabulary is loaded. Then a Searcher object is created and 
a regular query is performed and stored in the list res_reg. The features for the query 
image are loaded. Then for each image in the resuit list, the features are loaded and 
matched against the query image. Flomographies are computed from the matches and 
the number of inliers counted. If the homography fitting fails, we set the inlier list 
to an empty list. Finally we sort the dictionary rank that contains image index and 
inlier count according to decreasing number of inliers. The resuit lists are printed to 
the console and the top images visualized. 

The output looks like this: 


top matches (regular): [39, 22, 74, 82, 50, 37, 38, 17, 29, 68, 52, 91, 15, 90, 31, ... ] 
top matches (homography): [39, 38, 37, 45, 67, 68, 74, 82, 15, 17, 50, 52, 85, 22, 87, ... ] 


Figure 7-3 shows some sample results with the regular and re-ranked top images. 
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Figure 7-3. Some example search results with re-ranking based on geometric consistency using 
homographies. For each example, the top row is the regular resuit and the bottom row the re-ranked 
resuit. 


7.6 Building Demos and Web Applications 

In this last section on searching, we’ll take a look at a simple way of building demos and 
web applications with Python. By making demos as web pages, you automatically get 
cross-platform support and an easy way to show and share your project with minimal 
requirements. In the sections below we will go through an example of a simple image 
search engine. 

Creating Web Applications with CherryPy 

To build these demos, we will use the CherryPy package, available at http://www 
.cherrypy.org/. CherryPy is a pure Python lightweight web server that uses an object 
oriented model. See Appendix A for more details on how to install and configure 
CherryPy. Assuming that you have studied the tutorial examples enough to have an 
initial idea of how CherryPy works, lets build an image search web demo on top of the 
image searcher you created earlier in this chapter. 

Image Search Demo 

First, we need to initialize with a few html tags and load the data using Pickle. We need 
the vocabulary for the Searcher object that interfaces with the database. Create a file 
searchdemo.py and add the following class with two methods: 
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import cherrypy, os, urllib, pickle 
import imagesearch 

class SearchDemo(object): 

def _init_(self): 

# load list of images 

with open( 'webimlist.txt ') as f: 
self.imlist = f.readlines() 

self.nbr_images = len(self.imlist) 
self.ndx = range(self.nbr_images) 

# load vocabulary 

with open('vocabulary.pkl', 'rb') as f: 
self.voc = pickle.load(f) 

# set max number of lesults to shou 
self.maxres = 15 

# headei and footer html 
self.header = 

<!doctype html> 

<head> 

<title>Image search example</title> 
</head> 

<body> 

self. footer = 

</body> 

</html> 


def index(self,query=None): 
self.src = imagesearch.Searcher('web.db',self.voc) 

html = self.header 
html += """ 

<br /> 

Click an image to search. <a href='?query='>Random selection</a> of images. 
<br /xbr /> 

if query: 

# query the database and get top images 
res = self.src.query(query)[:self.maxres] 
for dist,ndx in res: 

imname = self.src.get_filename(ndx) 
html += "<a href='?query="+imname+"'>" 
html += "<img src='"+imname+"' width='lOO' />" 
html += "</a>" 
else: 

# Show random selection if no query 
random.shuffle(self.ndx) 

for i in self.ndx[:self.maxres]: 
imname = self.imlist[i] 
html += "<a href='?query="+imname+"'>" 
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html += "<img src="'+imname+'" width='lOO' />" 
html += "</a>" 

html += self.footer 
return html 


index.exposed = True 

cherrypy.quickstart(SearchDemo(), , 

config=os.path.join(os.path.dirname(_file_), 'service.conf')) 

As you can see, this simple demo consists of a single class with one method for 
initialization and one for the “index” page (the only page in this case). Methods are 
automatically mapped to URLs, and arguments to the methods can be passed directly 
in the URL. The index method has a query parameter which, in this case, is the query 
image to sort the others against. If it is empty a random selection of images is shown 
instead. The line 

index.exposed = True 

makes the index URL accessible and the last line starts the CherryPy web server with 
configurations read from service.conf. Our configuration file for this example has the 
following lines: 

[global] 

server. socket_host = ''127.0.0.1" 
server.socket_port = 8080 
server.thread_pool = 50 
tools.sessions.on = True 


[/] 

tools.staticdir.root = "tmp/" 
tools.staticdir.on = True 
tools.staticdir.dir = "" 


The first part specifies which IP address and port to use. The second part enables a local 
folder for reading (in this case “tmp/”). This should be set to the folder containing your 
images. 





Dont put anything secret in that folder if you plan to show this to people. 
The content of the folder will be accessible through CherryPy. 


Start your web server with 
$ python searchdemo.py 

from the command line. Opening your browser and pointing it at the right URL (in 
this case http:lll27.0.0.1:80801) should show the initial screen with a random selection 
of images. This should look like the top image in Figure 7-4. Clicking an image starts 
a query and shows the top results. Clicking an image in the results starts a new query 
with that image, and so on. There is a link to get back to the starting state of a random 
selection (by passing an empty query). Some examples are shown in Figure 7-4. 
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image search example 

4 I ^ + «,• http://localhost:e080/?query> C Q.’Coogle 

Click an image to scaich. Random sclcction of images. 



Image search example 

4|^ + http://iocalhost:6080/7query»uk&er‘ (S Q'Coogle 


Image search example 

4 !_► + *« hctp;//localhost:8080/?query”Ukber‘ C O.’ Coogie 


Click an image to seaich. Random sclcction of images. 


Click an image to search. Random sclcction of i ma ges. 



Figure 7-4. Some example search results on the ukbench data set: the starting page, which shows a 
random selection ofthe images (top); sample queries (bottom). The query image is shown on the top-left 
corner followed by the top image results shown row-wise. 


This example shows a full integration from web page to database queries and presen- 
tation of results. Naturally, we kept the styling and options to a minimum and there are 
many possibilities for improvement. For example, you may add a stylesheet to make it 
prettier or upload files to use as queries. 


Exercises 

1. Try to speed up queries by only using part of the words in the query image to 
construet the list of candidates. Use the idf weight as a criteria for which words 
to use. 

2. Implement a visual stop word list of the most common visual words in your 
vocabulary (say the top 10%) and ignore these words. How does this change the 
search quality? 

3. Visualize a visual word by saving ali image features that are mapped to a given word 
id. Crop image patehes around the feature locations (at the given scale) and plot 
them in a figure. Do the patehes for a given word look the same? 

4. Experiment with using different distance measures and weighting in the query() 
method. Use the score from coiripute_ukbench_score() to measure your progress. 
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5. Throughout this chapter, we only used SIFT features in our vocabulary This 
completely disregards the color information, as you can see in the example results 
in Figure 7-2. Try to add color descriptors and see if you can improve the search 
results. 

6. For large vocabularies, using arrays to represent the visual word frequencies is 
inefficient, since most of the entries will be zero (think of the case with a few 
hundred thousand words and images with typically a thousand features). One way 
to overcome this inefficiency is to use dictionaries as sparse array representations. 
Replace the arrays with a sparse class of your own and add the necessary methods. 
Alternatively try to use the scipy.sparse module. 

7. As you try to increase the size of the vocabulary, the clustering time will take 
too long and the projection of features to words also becomes slower. Implement 
a hierarchical vocabulary using hierarchical fc-mean clustering and see how this 
improves scaling. See the paper [23] for details and inspiration. 
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CHAPTER8 


Classifying Image Content 


This chapter introduces algorithms for classifying images and image content. We look 
at some simple but effective methods as well as state-of-the-art classifiers and apply 
them to two-class and multi-class problems. We show examples with applications in 
gesture recognition and object recognition. 

8.1 /T-NearestNeighbors 

One of the simplest and most used methods for classification is the k-nearest neighbor 
classifier (kNN). The algorithm simply compares an object (for example a feature vector) 
to be classified with all objects in a training set with known class labeis and lets the 
k nearest vote for which class to assign. This method often performs well but has a 
number of drawbacks. Same as with the k-m&ans clustering algorithm, the number k 
needs to be chosen and the value will affect performance. Furthermore, the method 
requires the entire training set to be stored, and if this set is large, it will be slow to 
search. For large training sets some form of binning is usually used to reduce the number 
of comparisons needed.^ On the positive side, there are no restrictions on what distance 
measure to use; practically anything you can think of will work (which is not the same 
as saying that it will perform well). The algorithm is also trivially parallelizable. 

Implementing kNN in a basic form is pretty straightforward. Given a set of training 
examples and a list of associated labeis, the code below does the job. The training 
examples and labeis can be rows in an array or just in a list. They can be numbers, 
strings, whatever you like. Add this class to a file called knn.py. 

class KnnClassifier(object): 

def _init_(self,labeis,samples): 

Initialize classifier with training data. 

self.labeis = labeis 
self.samples = samples 


1 


Another option is to only keep a selected subset of the training set. This can, however, impact accuracy. 
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def classify(self,point,k=3): 

""" Classify a point against k nearest 
in the training data, return label. """ 

# compute distance to ali training points 

dist = array([L2dist(point,s) for s in self.samples]) 

# sort them 

ndx = dist.argsortO 

# use dictionary to store the k nearest 
votes = {} 

for i in range(k): 
label = self.labels[ndx[i]] 
votes.setdefault(label,o) 
votes[label] += 1 

return niax(votes) 

def L2dist(pl,p2): 
return sqrt( sum( (pl-p2)**2) ) 

It is easiest to define a class and initialize with the training data. That way, we don’t 
have to store and pass the training data as arguments every time we want to classify 
something. Using a dictionary for storing the k nearest labeis makes it possible to have 
labeis as text strings or numbers. In this example, we used the Euclidean (L 2 ) distance 
measure. If you have other measures, just add them as functions. 

A Simple 2D Example 

Lets first create some simple 2D example data sets to illustrate and visualize how this 
classifier works. The following script will create two different 2D point sets, each with 
two classes, and save the data using Pickle: 

from numpy.random import randn 
import pickle 

# create sample data of 2D points 
n = 200 

# two normal distributions 
class_l = 0.6 * randn(n,2) 

class _2 = 1.2 * randn(n, 2 ) + array([5,l]) 
labeis = hstack((ones(n),-ones(n))) 

# save with Pickle 

with open('points_normal.pkl', 'w') as f: 
pickle.dump(class_l,f) 
pickle. dunip(class_2,f) 
pickle.dumpllabeis, f) 


# normal distribution and ring around it 
class_l = 0.6 * randn(n,2) 
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r = 0.8 * randn(n,l) + 5 

angle = 2*pi * randn(n,l) 

class_2 = hstack((r*cos(angle),r*sin(angle))) 

labeis = hstack((ones(n),-ones(n))) 

# save with Pickle 

with open('points_ring.pkl', 'w') as f: 
pickle. dump(class_l,f) 
pickle. dump(class_2,f) 
pickle.dump(labeis,f) 

Run the script twice with different filenames, for example points_normal_test.pkl first 
and points_ring_test.pkl the second time. You will now have four files with 2D data sets, 
two files for each of the distributions. We can use one for training and the other for 
testing. 

Lets see how to do that with the kNN classifier. Create a script with the following 
commands: 

import pickle 
import knn 
import imtools 

# load 2D points using Pickle 

with open('points_normal.pkl', 'r') as f: 
class_l = pickle.load(f) 
class_2 = pickle.load(f) 
labeis = pickle.load(f) 

model = knn.KnnClassifier(labels,vstack((class_l,class_2))) 

This will create a kNN classifier model using the data in the Pickle file. Now add the 
following: 

# load test data using Pickle 

with open('points_normal_test.pkl', 'r') as f: 
class_l = pickle.load(f) 
class_2 = pickle.load(f) 
labeis = pickle.load(f) 

# test on the first point 
print model.classify(class_l[o]) 

This loads the other data set (the test set) and prints the estimated class label of the 
first point to your console. 

To visualize the classification of all the test points and show how well the classifier 
separates the two classes, we can add these lines: 

# define function for plotting 
def classify(x,y,model=model): 

return array([model.classify([xx,yy]) for (xx,yy) in zip(x,y)]) 

# plot the classification boundary 

imtools.plot_2D_boundary([-6,6,-6,6],[class_l,class_2],classify,[l,-l]) 
show() 
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Here we created a small helper function that takes arrays of 2D coordinates x and y 
and the classifier and returns an array of estimated class labeis. Now we can pass this 
function as an argument to the actual plotting function. Add the following function to 
your file irntools: 

def plot_2D_boundary(plot_range,points,decisionfcn,labeis,values=[o]): 

Plot_range is (xmin,xmax,ymin,ymax), points is a list 
of class points, decisionfcn is a funtion to evaluate, 
labeis is a list of labeis that decisionfcn returns for each class, 
values is a list of decision contours to show. 

clist = ['b','r','g','k','m','y'] # colors for the classes 

tf evaluate on a grid and plot contour of decision function 
X = arange(plot_range[o],plot_range[l], .1) 
y = arange(plot_range[2],plot_range[3], .1) 

XX, yy = meshgrid(x,y) 

XXX, yyy = xx.flatten(),yy.flatten() # lists of x,y in grid 
zz = array(decisionfcn(xxx,yyy)) 
zz = zz.reshape(xx.shape) 

# plot contour(s) at values 
contour(xx,yy,zz,values) 

# for each class, plot the points with for correct, 'o' for incorrect 
for i in range(len(points)): 

d = decisionfcn(points[i][:,0],points[i][:,1]) 
correct_ndx = labels[i]==d 
incorrect_ndx = labels[i]!=d 

plot(points[i][correct_ndx,o],points[i][correct_ndx,l],,color=clist[i]) 
plot(points[i][incorrect_ndx,o],points[i][incorrect_ndx,l],'o',color=clist[i]) 

axis('equal') 

This function takes a decision function (the classifier) and evaluates it on a grid using 
(neshgrid(). The contours of the decision function can be plotted to show where the 
boundaries are. The default is the zero contour. The resulting plots look like the ones 
in Figure 8-1. As you can see, the kNN decision boundary can adapt to the distribution 
of the classes without any explicit modeling. 


Dense SIFT as Image Feature 

Lets look at classi fyingsome images. To do so, we need a feature vector for the image. We 
saw feature vectors with average RGB and PCA coefficients as examples in the clustering 
chapter. Here we will introduce another representation, the dense SIFT feature vector. 

A dense SIFT representation is created by applying the descriptor part of SIFT to a 
regular grid across the whole image.^ We can use the same executables as in Section 2.2 
and get dense SIFT features by adding some extra parameters. Create a file dsift.py as a 
placeholder for the dense SIFT computation and add the following function: 


^ Another common name is Histogram of Oriented Gradients (HOG). 
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Figure 8-1. Classifying 2D data using a k-nearest neighbor dassifier. For each example, the color shows 
the class lahel. Correctly dassified points are shown with stars and misdassified points with circles. 
The curve is the dassifier decision boundary. 


import sift 

def process_image_dsift(imagename,resultname,size=20,steps=10, 
force_orientation=False,resize=None): 

Process an image with densely sampled SIFT descriptors 
and save the results in a file. Optional input: size of features, 
steps between locations, forcing computation of descriptor orientation 
(False means all are oriented upward), tuple for resizing the image."" 

im = Image.open(imagename).convert('L') 
if resize!=None: 

im = im.resize(resize) 
m,n = im.size 

if imagename[-3:] != 'pgm': 

# create a pgm file 
im.save('tmp.pgm') 
imagename = 'tmp.pgm' 

# create frames and save to temporary file 
scale = size/ 3.0 

x,y = meshgrid(range(steps,m,steps),range(steps,n,steps)) 

XX, yy = x.flatten(),y.flatten() 

frame = array([xx,yy,scale*ones(xx.shape[o]),zeros(xx.shape[o])]) 
savetxt('tmp.frame',frame.T,fmt='%03.3f') 

if force_orientation: 

cmmd = str("sift "+imagename+" --output="+resultname+ 

" --read-frames=tmp.frame --orientations") 

else: 

cmmd = str("sift "+imagename+" --output="+resultname+ 

" --read-frames=tmp.frame") 
os.system(cmmd) 

print 'processed', imagename, 'to', resultname 
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Figure 8-2. An example ofapplying dense SIFT descriptors over an image. 


Compare this to the function process_image() in Section 2.2. We use the function 
savetxtO to store the/rame array in a text file for command line processing. The last pa- 
rameter of this function can be used to resize the image before extracting the descriptors. 
For example, passing imsize= (100, 100) will resize to square images 100 x 100 pixels. 
Last, il force_orientation is true the descriptors will be normalized based on the local 
dominant gradient direction. If it is false, all descriptors are simply oriented upward. 

Use it like this to compute the dense SIFT descriptors and visualize the locations: 

import dsiftjSift 

dsift.process_image_dsift( 'empire.jpg' ,'empire.sift',90,40,True) 
l,d = sift.read_features_from_file('empire.sift') 

im = array(Image.open('empire.jpg')) 

sift.plot_features(im,l,True) 

show() 

This will compute SIFT features densely across the image with the local gradient 
orientation used to orient the descriptors (by setting force_orientation to true). The 
locations are shown in Figure 8-2. 

Classifying Images—Hand Gesture Recognition 

In this application, we will look at applying the dense SIFT descriptor to images of 
hand gestures to build a simple hand gesture recognition system. We will use some im¬ 
ages from the Static Fland Posture Database (available at http://www.idiap.ch/resource/ 
gestures/) to illustrate. Download the smaller test set (“test set 16.3Mb” on the web page) 
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Figure 8-3. Dense SIFT descriptors on sample images from the six categories of hand gesture images. 
(images from the Static Hand Posture DataBase) 


and take all the images in the “uniform” folders and split each class evenly into two 
folders called “train” and “test”. 

Process the images with the dense SIFT function above to get feature vectors for all 
images. Again, assuming the filenames are in a list imlist, this is done like this: 

import dsift 

# process images at fixed size (50j50) 
for filename in imlist: 
featfile = filename[:-3]+'dsift' 

dsift.process_image_dsift(filename,featfile,10,5,resize=(50,50)) 

This creates feature files for each image with the extension “.dsift”. Note the resizing of 
the images to some common fixed size. This is very important; otherwise your images 
will have varying number of descriptors, and therefore varying length of the feature 
vectors. This will cause errors when comparing them later. Plotting the images with the 
descriptors looks like Figure 8-3. 

Define a helper function for reading the dense SIFT descriptor files like this: 
import os, sift 

def read_gesture_features_labels(path): 

# create list of all files ending in .dsift 

featlist = [os.path.join(path,f) for f in os.listdir(path) if f.endswith('.dsift')] 
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# read the features 
features = [] 

for featfile in featlist: 
l,d = sift.read_features_from_file(featfile) 
features.append(d.flatten()) 
features = array(features) 

# create labeis 

labeis = [featfile.split('/')[-1][ 0 ] for featfile in featlist] 
return features,array(labels) 

Then we can read the features and labeis for our test and training sets using the following 
commands: 

features,labeis = read_gesture_features_labels('train/') 
test_features,test_labels = read_gesture_features_labels('test/') 
classnames = unique(labels) 

Here we used the first letter in the filename to create class labeis. Using the NumPy function 
uniqueO, we get a sorted list of unique class names. 

Now we can try our nearest neighbor code on this data: 

# test kNN 
k = 1 

knn_classifier = knn.KnnClassifier(labels,features) 
res = array([knn_classifier.classify(test_features[i],k) for i in 
range(len(test_labels))]) 


# accuracy 

acc = sum(l.O*(res==test_labels)) / len(test_labels) 
print 'Accuracy:', acc 

First, the classifier object is created with the training data and labeis as input. Then 
we iterate over the test set and classify each image using the classifyO method. The 
accuracy is computed by multiplying the boolean array by one and summing. In this 
case, the true values are 1, so it is a simple thing to count the correct classifications. 
This gives a printout like: 

Accuracy: O .811518324607 

which means that 81% were correctly classified in this case. The value will vary with 
the choice of k and the parameters for the dense SIFT image descriptor. 

The accuracy above shows how many correct classifications there are for a given test set, 
but it does not teli us which signs are hard to classify or which mistakes are typically 
made. A confusiori matrix is a matrix that shows how many samples of each class are 
classified as each of the classes. It shows how the errors are distributed and what classes 
are often “confused” for each other. 
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The following function will print the labeis and the corresponding confusion matrix: 
def print_confusion(res,labeis,classnames): 
n = len(classnames) 

# confusion matrix 

class_ind = dict([(classnames[i],i) for i in range(n)]) 

confuse = zeros((n,n)) 
for i in range(len(test_labels)): 
confuse[class_ind[res[i]],class_ind[test_labels[i]]] += 1 

print 'Confusion matrix for' 
print classnames 
print confuse 

The printout of running 

print_confusion(res,test_labels,classnames) 
looks like this: 


Confusion matrix 
['A' 'B' 'C' 'F' 
[[ 26. 0. 2. 

for 

■p’ 

0. 

•V] 

1. 

1.] 

[ 0. 

26. 

0. 

1. 

1. 

1.] 

[ 0. 

0. 

25. 

0. 

0. 

1.] 

[ 0. 

3. 

0. 

37. 

0. 

0.] 

[ 0. 

1. 

2. 

0. 

17. 

1.] 

[ 3. 

1. 

3. 

0. 

14. 

24.]] 


This shows that, for example, in this case “P” (“Point”) is often misclassified as “V” 

8.2 Bayes Classifier 

Another simple but powerful classifier is the Bayes classifier^ (or naive Bayes classifier). 
A Bayes classifier is a probabilistic classifier based on applying Bayes’ theorem for 
conditional probabilities. The assumption is that all features are independent and 
unrelated to each other (this is the “naive” part). Bayes classifiers can be trained very 
efficiently since the model chosen is applied to each feature independently Despite 
their simplistic assumptions, Bayes classifiers have been very successful in practical 
applications, in particular for email spam filtering. Another benefit of this classifier is 
that once the model is learned, no training data needs to be stored. Only the model 
parameters are needed. 

The classifier is constructed by multiplying the individual conditional probabilities 
from each feature to get the total probability of a class. Then the class with highest 
probability is selected. 


^ After Thomas Bayes, an 18th-century English mathematician and minister. 
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Lets look at a basic implementation of a Bayes classifier using Gaussian probability 
distributiori models. This means that each feature is individually modeled using the 
feature mean and variance, computed from a set of training data. Add the following 
classifier class to a file called bayes.py. 

class BayesClassifier(object): 
def _init_(self): 

Initialize classifier with training data. """ 

self.labeis = [] # class labeis 

self.mean = [] # class mean 

self.var = [] # class variances 

self.n =0 # nbr of classes 

def train(self,data,labels=None): 

Train on data (list of arrays n*dim). 

Labeis are optional, default is 0...n-l. """ 

if labels==None: 

labeis = range(len(data)) 
self.labeis = labeis 
self.n = len(labels) 

for c in data: 

self.mean.append(mean(c,axis=o)) 
self.var.append(var(c,axis=o)) 

def classify(self,points): 

Classify the points by computing probabilities 
for each class and return most probable label. """ 

# compute probabilities for each class 

est_prob = array([gauss(m,v,points) for m,v in zip(self.mean,self.var)]) 

# get index of highest probability, this gives class label 
ndx = est_prob.argmax(axis=o) 

est_labels = array([self.labels[n] for n in ndx]) 

return est_labels, est_prob 

The model has two variables per class, the class mean and covariance. The train () 
method takes a lists of feature arrays (one per class) and computes the mean and 
covariance for each. The method classify () computes the class probabilities for an 
array of data points and selects the class with highest probability The estimated class 
labeis and probabilities are returned. The helper function for the actual Gaussian 
function is also needed: 

def gauss(m,v,x): 

Evaluate Gaussian in d-dimensions with independent 
mean m and variance v at the points in (the rows of) x. 

if len(x.shape)==l: 

n,d = l,x.shape[o] 
else: 

n,d = x.shape 
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# covariance matrix^ subtract mean 
S = diag(l/v) 

X = x-m 

# produci of piobabilities 

y = exp(-0.5*diag(dot(x,dot(S,x.T)))) 

# normalize and return 

return y * (2*pi)**(-d/2.o) / ( sqrt(prod(v)) + le-6) 

This function computes the product of the individual Gaussian distributions and 
returns the probability for a given pair of model parameters m,v. For more details 
on this function, see for example http://en.wikipedia.org/wiki/Multivariate_normal_ 
distribution. 

Try this Bayes classifier on the 2D data from the previous section. This script will load 
the exact same point sets and train a classifier: 

import pickle 
import bayes 
import imtools 

# load 2D example points using Pickle 
with open('poin'ts_normal.pkl', 'r') as f: 

class_l = pickle.load(f) 
class_2 = pickle.load(f) 
labeis = pickle.load(f) 

# train Bayes classifier 

bc = bayes. BayesClassifierO 
bc. train([class_l,class_2],[1,-1]) 

Now, we can load the other one and test the classifier: 

# load test data using Pickle 

with open('points_normal_test.pkl', 'r') as f: 
class_l = pickle.load(f) 
class_2 = pickle.load(f) 
labeis = pickle.load(f) 

# test on some points 

print bc.classify(class_l[:lo])[o] 

# plot points and decision boundary 
def classify(x,y,bc=bc): 

points = vstack((x,y)) 
return bc.classify(points.T) [o] 

imtools.plot_2D_boundary([-6,6,-6,6],[class_l,class_2],classify,[l,-l]) 
show() 

This prints the classification resuit for the first 10 points to the console. It might look 
like this: 

[ 1111111111 ] 

Again, we used a helper function classifyf) to pass to the plotting function for visu- 
alizing the classification results by evaluating the function on a grid. The plots for the 
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Figure 8-4. Classifying 2D data using a Bayes classifier. For each example, the color shows the class 
label. Correctly classified points are shown with stars and misclassified points with circles. The curve 
is the classifier decision boundary. 


two sets look like Figure 8-4. The decision boundary, in this case, will be the ellipse-like 
level curves of a 2D Gaussian function. 

Using PCA to Reduce Dimensions 

Now, lets try the gesture recognition problem. Since the feature vectors are very large for 
the dense SIFT descriptor (more than 10,000 for the parameter choices in the example 
above), it is a good idea to do dimensionality reduction before fitting models to the 
data. Principal Component Analysis, PCA, (see Section 1.3) usually does a good job. 
Try the following script that uses PCA from the file pca.py (page 13): 

import pca 

V,S,m = pca.pca(features) 

# keep most important dimensions 
V = V[:50] 

features = array([dot(V,f-m) for f in features]) 
test_features = array([dot(V,f-m) for f in test_features]) 

Hete features and test_features are the same arrays that we loaded for the kNN example. 
In this case, we apply PCA on the training data and keep the 50 dimensions with most 
variance. This is done by subtracting the mean m (computed on the training data) and 
multiplying with the basis vectors V. The same transformation is applied to the test 
data. 

Train and test the Bayes classifier like this: 

# test Bayes 

bc = bayes.BayesClassifierO 

blist = [features[where(labels==c) [o]] for c in classnames] 

bc.train(blistjClassnames) 

res = bc.classify(test_features) [o] 
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Since BayesClassifier takes a list of arrays (one array for each class), we transform the 
data before passing it to the train() method. Since we don’t need the prohahilities for 
now, we chose to return only the lahels of the classification. 

Checking the accuracy 

acc = sum(l.O*(res==test_labels)) / len(test_labels) 
print 'Accuracy:', acc 

gives something like this: 

Accuracy: 0.717277486911 

Checking the confusion matrix 

print_confusion(res,test_labels,classnames) 

gives a print out like this: 


Confusion matrix for 
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•V] 


[[ 20. 

0. 

0. 

4. 

0. 

0.] 

[ 0. 

26. 

1. 

7. 

2. 

2.] 

[ 1. 

0. 

27. 

5. 

1. 

0.] 

[ 0. 

2. 

0. 

17. 

0. 

0.] 

[ 0. 

1. 

0. 

4. 

22. 

1.] 

[ 8. 

2. 

4. 

1. 

8. 

25.]] 


This is not as good as the kNN classifier, hut with the Bayes classifier we dont need 
to keep any training data, just the model parameters for each of the classes. The resuit 
will vary greatly with the choice of dimensions after PCA. 

8.3 Support Vector Machines 

Support Vector Machines (SVM) are a powerful type of classifiers that often give state- 
of-the-art results for many classification prohlems. In its simplest form, an SVM finds 
a linear separating hyperplane (a plane in higher-dimensional spaces) with the hest 
possihle separation hetween two classes. The decision function for a feature vector x is 

/(x)=wx-fi, 

where w is the hyperplane normal and b an offset constant. The zero level of this 
function then ideally separates the two classes so that one class has positive values 
and the other negative. The parameters w and b are found hy solving an optimization 
prohlem on a training set of laheled feature vectors x,- with lahels y,- G {—1, 1} so that 
the hyperplane has maximal separation hetween the two classes. The normal is a linear 
comhination of some of the training feature vectors 

w = ^Q!,y,x,-, 
i 

so that the decision function can he written 

/(x) = ^ • X - Z?. 
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Here i runs over a selection of the training vectors; the selected vectors x,- are called 
support vectors since they help define the classification boundary 

One of the strengths of SVM is that by using kernel functions, that is, functions that map 
the feature vectors to a different (higher) dimensional space, non-linear or very difficult 
classification problems can be effectively solved while stili keeping some control over 
the decision function. Kernel functions replace the inner product of the classification 
function, x,- • x, with a function K (x,-, x). 

Some of the most common kernel functions are: 

• linear, a hyperplane in feature space, the simplest case, K (x,-, x) = x,- • x 

• polynomial, features are mapped with polynomials of a defined degree d, 
K(x,-, x) = (yx,- • X + rf, y >0 

• radial basis functions, exponential functions, usually a very effective choice, 

is:(x,-,x) = y > 0 

• sigmoid, a smoother alternative to hyperplane, Kfx,-, x) = tanhfyx,- • x + r) 

The parameters of each kernel are also determined during training. 

For multi-class problems, the usual procedure is to train multiple SVMs so that each 
separates one class from the rest (also known as “one-versus-all” classifiers). For more 
details on SVMs, see for example the book [9] and the online references at http://www 
.support-vector.net/references.html. 

Using LibSVM 

We will use one of the best and most commonly used implementations available, 
LibSVM [7]. LibSVM comes with a nice Python interface (there are also interfaces for 
many other programming languages). For installation instructions, see Section A.4. 

Lets use LibSVM on the sample 2D point data to see how it works. This script will load 
the same points and train an SVM classifier using radial basis functions: 

import pickle 
from svmutil import * 
import imtools 

# load 2D example points using Pickle 
with open('points_normal.pkl', 'r') as f: 

class_l = pickle.load(f) 
class_2 = pickle.load(f) 
labeis = pickle.load(f) 

# convert to lists for libsvm 
class_l = map(list,class_l) 
class_2 = map(list,class_2) 
labeis = list(labels) 

samples = class_l+class_2 # concatenate the two lists 
It create SVM 

prob = svm_problem(labels, samples) 
param = svm_parameter('-t 2') 
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# train SI/A) on data 

m = svm_train(prob,param) 

# how did the training do? 

res = svm_predict(labels,samples,m) 

Loading the data set is the same as before, but this time we have to convert the arrays 
to lists since LibSVM does not support array objects as input. Here we used Pythons 
built-in mapO function that applies the conversion function list() to each element. The 
next lines create a SVM problem object and sets some parameters. The svm_train() call 
solves the optimization problem for determining the model parameters. The model 
can then be used for predictions. The last call to svm_predict() will classify the training 
data with the model m and shows how successful the training was. The printout looks 
something like this: 

Accuracy = 100% ( 400 / 400 ) (classification) 

This means that the classifier completely separates the training data and correctly 
classi fies all 400 data points. 

Note that we added a string of parameter choices in the call to train the classifier. These 
parameters are used to control the kernel type, degree, and other choices for the classi¬ 
fier. Most of them are outside the scope of this book but the important ones to know 
are “t” and “k”. The parameter “t” determines the type of kernel used. The options are: 

"-t" kernel 

0 linear 

1 polynomial 

2 radial basis function (default) 

3 sigmoid 

The parameter “k” determines the degree of the polynomial (default is 3). 

Now, load the other point set and test the classifier: 

# load test data using Pickle 

with open('points_normal_test.pkl', 'r') as f: 
class_l = pickle.load(f) 
class_2 = pickle.load(f) 
labeis = pickle.load(f) 

# convert to lists for libsvm 
class_l = map(list,class_l) 
class_2 = map(list,class_2) 

# define function for plotting 
def predict(x,y,model=m): 

return array(svm_predict([ 0 ]*len(x),zip(x,y),model)[ 0 ]) 

# plot the classification boundary 

imtools.plot_2D_boundary([-6,6,-6,6],[array(class_l),array(class_2)],predict,[-1,1]) 
show() 

Again we have to convert the data to lists for LibSVM. As before, we also define a helper 
function predict () for plotting the classification boundary Note the use of a list of zeros 
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Figure 8-5. Classifying 2D data using a Support Vector Machine classifier. For each example, the color 
shows the class lahel. Correctly classified points are shown with stars and misclassified points with 
circles. The curve is the classifier decision boundary. 


[o]*len(x) as a replacement for the label list if true labeis are unavailable. You can use 
any list as long as it has the correct length. The 2D plots for the two different point 
data sets are shown in Figure 8-5. 


Hand Gesture Recognition Again 

Using LibSVM on our multi-class hand gesture recognition problem is fairly straight- 
forward. Multiple classes are automatically handled, so we only need to format the data 
so that the input and output matches the requirements of LibSVM. 

With training and testing data in arrays named features and test_features as in the 
previous examples, the following will load the data and train a linear SVM classifier: 

features = map(list,features) 
test_features = map(list,test_features) 

# create conversion function for the labeis 
transi = {} 

for i,c in enumerate(classnames): 
transl[c],transl[i] = i,c 

# create SVM 

prob = svm_problem(convert_labels(labels,transi),features) 
param = svm_parameter('-t 0') 

# train SVM on data 

m = svm_train(prob,param) 

# hou did the training do? 

res = svm_predict(convert_labels(labels,transi),features,m) 

# test the SVM 

res = svm_predict(convert_labels(test_labels,transi),test_features,m) [o] 
res = convert_labels(res,transi) 
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Same as before, we convert the data to lists using a map() call. Then the labeis need to 
be converted since LibSVM does not handle string labeis. The dictionary transi will 
contain a conversion between string and integer labeis. Try to print it to your console 
to see what happens. The parameter “-t 0” makes it a linear classifier and the decision 
boundary will be a hyperplane in the original feature space of some 10,000 dimensions. 

Now compare the labeis, just like before: 

acc = sum(l.O*(res==test_labels)) / len(test_labels) 
print 'Accuracy:', acc 

print_confusion(res,test_labels,classnames) 

The output using this linear kernel should look like this: 

Accuracy: 0.916230366492 
Confusion matrix for 


['A' 'B' 

' 'C' 

'F' 

'P' 

-V] 


[[ 26. 

0. 

1. 

0. 

2 . 

0 .] 

[ 0. 

28. 

0. 

0. 

1 . 

0 .] 

[ 0. 

0. 

29 . 

0. 

0 . 

0 .] 

[ 0. 

2. 

0. 

38. 

0 . 

0 .] 

[ 0. 

1. 

0. 

0. 

27 . 

!•] 

[ 3. 

0. 

2. 

0. 

3 . 

27 .]] 


Now if we apply PCA to reduce the dimensions to 50, as we did in Section 8.2, this 
changes the accuracy to: 

Accuracy: 0.890052356021 

Not bad, seeing that the feature vectors are about 200 times smaller than the original 
data (and the space to store the support vectors then also 200 times less). 

8.4 Optical Character Recognition 

As an example of a multi-class problem, lets look at interpreting images of Sudokus. 
Optical character recognition (OCR) is the process of interpreting images of hand- or 
machine-written text. A common example is text extraction from scanned documents 
such as zip-codes on letters or book pages such as the library volumes in Google Books 
{http://books.google.com/). Here we will look at a simple OCR problem of recognizing 
numbers in images of printed Sudokus. Sudokus are a form of logic puzzles where the 
goal is to fili a 9 X 9 grid with the numbers 1. . . 9 so that each column, each row, and 
each 3x3 sub-grid contains all nine digits."^ In this example, we are just interested in 
reading the puzzle and interpreting it correctly Actually solving the puzzle, we leave 
to you. 

Training a Classifier 

For this classification problem we have ten classes, the numbers 1... 9, and the empty 
cells. Lets give the empty cells the label 0 so that our class labeis are 0 ... 9. To train 

See http://en.wikipedia.org/wiki/Sudoku for more details if you are unfamiliar with the concept. 
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Figure 8-6. Sample tmining imagesfor the 10 classes ofthe Sudoku OCR classifier. 


this ten-class classifier, we will use a dataset of images of cropped Sudoku cells.^ In the 
file sudoku_images.zip are two folders, “ocr_data” and “sudokus”. The latter contains 
images of Sudokus under varying conditions. We will save those for later. For now, 
take a look at the folder “ocr_data”. It contains two subfolders with images, one for 
training and one for testing. The images are named with the first character equal to the 
class (0 . . . 9). Figure 8-6 shows some samples from the training set. The images are 
grayscale and roughly 80 x 80 pixels (with some variation). 

Selecting Features 

We need to decide on what feature vector to use for representing each cell image. There 
are many good choices; here we’ll try something simple but stili effective. The following 
function takes an image and returns a feature vector of the flattened grayscale values: 

def compute_feature(ini): 

Returns a feature vector for an 
ocr image patch. """ 

# resize and remove border 
norm_im = imresize(im,( 30 , 30 )) 
norm_im = norm_im[3:-3,3:-3] 

return norm_im.flatten() 

This function uses the resizing function imresize() from imtools to reduce the length 
of the feature vector. We also crop away about 10% border pixels since the crops often 
get parts of the grid lines on the edges, as you can see in Figure 8-6. 

Now we can read the training data using a function like this: 

def load_ocr_data(path): 

Return labeis and ocr features for all images 
in patb. 

^Images courtesy of Martin Byrod [4], http://www.maths.lth.se/matematiklth/personaVbyrod/, collected and 
cropped from photos of actual Sudokus. 
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# create list of ali files ending in .jpg 

imlist = [os.path.join(path,f) for f in os.listdir(path) if f.endswith('.jpg')] 

# create labeis 

labeis = [int(imfile.split('/') [-1][ 0 ] ) for imfile in imlist] 

# create features from the images 
features = [] 

for imname in imlist: 

im = array(Image.open(imname).convert(' L')) 
features.append(compute_feature(im)) 
return array(features),labeis 

The labeis are extracted as the first character of the filename of each of the JPEG files and 
stored in the labeis list as integers. The feature vectors are computed using the function 
above and stored in an array 


Multi-ClassSVM 

With the training data in place, we are ready to learn a classifier. Here we’ll use a multi- 
class support vector machine. The code looks just as it does in the previous section: 

from svmutil import * 

# TRAINING DAIA 

features,labeis = load_ocr_data('training/') 

# TESTING DATA 

test_features,test_labels = load_ocr_data('testing/') 

# train a linear SVM classifier 
features = map(list,features) 
test_features = map(list,test_features) 

prob = svm_problem(labels,features) 
param = svm_parameter('-t 0') 

m = svm_train(prob,param) 

# how did the training do? 

res = svm_predict(labels,features,m) 

# how does it perform on the test set? 

res = svm_predict(test_labels,test_features,m) 

This trains a linear SVM classifier and tests the performance on the unseen images in 
the test set. You should get the following printout from the last two svm_predict() calls: 

Accuracy = 100% ( 1409 / 1409 ) (classification) 

Accuracy = 99.2979% (990/997) (classification) 

Great news. The 1,409 images of the training set are perfectly separated in the ten classes 
and the recognition performance on the test set is around 99%. We can now use this 
classifier on crops from new Sudoku images. 
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Extracting Cells and Recognizing Characters 

With a classifier that recognizes cell contents, the next step is to automatically find the 
cells. Once we solve that, we can crop them and pass the crops to the classifier. Lets for 
now assume that the image of the Sudoku is aligned so that the horizontal and vertical 
lines of the grid are parallel to the image sides (like the left image of Figure 8-8). Under 
these conditions, we can threshold the image and sum up the pixel values horizontally 
and vertically Since the edges will have values of one and the other parts values of zeros, 
this should give strong response at the edges and teli us where to crop. 

The following function takes a grayscale image and a direction and returns the ten 
edges for that direction: 

from scipy.ndimage import measurements 

def find_sudoku_edges(im,axis=o): 

Finds the cell edges for an aligned sudoku image. """ 

# threshold and sum rows and columns 
trim = l*(im<128) 

s = trim.sum(axis=axis) 

# find center of strongest lines 
s_labels,s_nbr = measurements.label(s>(o.5*max(s))) 

m = measurements.center_of_mass(s,s_labels,range(l,s_nbr+l)) 

X = [int(x[o]) for x in m] 

# if only the strong lines are detected add lines in between 
if len(x)==4: 

dx = diff(x) 

x = [x[0],x[0]+dx[0]/3,x[0]+2*dx[0]/3, 
x[l],x[l]+dx[l]/3,x[l]+2*dx[l]/3, 
x[2],x[2]+dx[2]/3,x[2]+2*dx[2]/3,x[3]] 

if len(x)==10: 

return x 
else: 

raise RuntimeError('Edges not detected.') 

First, the image is thresholded at the midpoint to give ones on the dark areas. Then 
these are added up in the specified direction (axis=0 or 1). The scipy.ndimage package 
contains a module, measurements, that is very useful for counting and measuring regions 
in binary or label arrays. First, labeis () finds the connected components of a binary 
array computed by thresholding the sum at the midpoint. Then the center_of_mass() 
function computes the center point of each independent component. Depending on 
the graphic design of the Sudoku (all lines equally strong or the sub-grid lines stronger 
than the other), you might get four or ten points. In the case of four, the intermediary 
lines are interpolated at even intervals. If the end resuit does not have ten lines, an 
exception is raised. 

In the “sudokus” folder is a collection of Sudoku images of varying difficulty There is 
also a file for each image containing the true values of the Sudoku so that we can check 
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our results. Some of the images are aligned with the image sides. Picking one of them, 
you can check the performance of the cropping and classification like this: 

imname = 'sudokus/sudokulS.jpg' 

vername = 'sudokus/sudokulS.sud' 

im = array(Image.open(imname).convert('L')) 

# find the cell edges 

X = find_sudoku_edges(im,axis=o) 
y = find_sudoku_edges(iffl,axis=l) 

# crop cells and classify 
crops = [] 

for coi in range(9): 
for row in range(9): 

crop = im[y[col]:y[col+l],x[row]:x[row+l]] 
crops.append(compute_feature(crop)) 

res = svm_predict(loadtxt(vername),map(list,crops),m)[o] 
res_im = array(res).reshape(9,9) 

print 'Resuit:' 
print res_im 

The edges are found and then crops are extracted for each cell. The crops are passed 
to the same feature extraction function used for the training and stored in an array 
These feature vectors are classified using svm_predict() with the true labeis read using 
loadtxt(). The resuit in your console should be: 


Accuracy = 100% ( 81 / 81 ) (classification) 
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Now, this was one of the easier images. Try some of the others and see what the errors 
look like and where the classifier makes mistakes. 

If you plot the crops using a 9 x 9 subplot, they should look like the right image of 
Figure 8-7. 

Rectifying Images 

If you are happy with the performance of your classifier, the next challenge is to apply it 
to non-aligned images. We will end our Sudoku example with a simple way of rectifying 
an image given that the four outer corner points of the grid have been detected or marked 
manually The left image in Figure 8-8 shows an example of a Sudoku image with strong 
perspective effects. 
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Figure 8-7. An example of detecting and cropping the fields of a Sudoku grid: image of a Sudoku grid 
(left); the9 X 9 cropped images ofthe individual cells to be sent to the OCR classifier (right). 


A homography can map the grid to align the edges as in the examples above, all we 
need to do is estimate the transform. The example below shows the case of manually 
marking the four corner points and then warping the image to a square target image of 
1000 X 1000 pixels: 

from scipy import ndimage 
import homography 

imname = 'sudoku8.jpg' 

im = array(Image.open(imname).convert('L')) 

# mark corners 
figureO 
imshow(im) 
grayO 

X = ginput(4) 

# top left, top right, bottom right, bottom left 
fp = array([array([p[l],p[o],l]) for p in x]).T 

tp = array([[0,0,l],[0,1000,1],[1000,1000,1],[1000,0,l]]).T 

# estimate the homography 

H = homography.H_from_points(tp,fp) 

# helper function for geometric_transform 
def warpfcn(x): 

X = array([x[o],x[l],l]) 
xt = dot(H,x) 
xt = xt/xt[2] 
return xt[o],xt[l] 

# warp image with full perspective transform 

im_g = ndimage.geometric_transform(im,warpfcn,(1000,1000)) 
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Figure 8-8. An example ofrectifying an image using afull perspective transform: the original image with 
thefour corners of the Sudoku marked (left); rectified image warped to a square image o/lOOO x 1000 
pixels (right). 


In most of these sample images an affine transform, as we used in Chapter 3, is 
not enough. Here we instead used the more general transform function 
geometric_transforin() from scipy.ndimage. This function takes a 2D to 2D mapping 
instead of a transform matrix, so we need to use a helper function (using a piecewise 
affine warp on triangles will introduce artifacts in this case). The warped image is 
shown to the right in Figure 8-8. 

This concludes our Sudoku OCR example. There are many improvements to be made 
and alternatives to investigate. Some are mentioned in the following exercises; the rest 
we leave to you. 


Exercises 

1. The performance of the kNN classifier depends on the value of k. Try to vary this 
number and see how the accuracy changes. Plot the decision boundaries of the 2D 
point sets to see how they change. 

2. The hand gesture data set in Figure 8-3 also contains images with more complex 
background (in the “complex/” folders). Try to train and test a classifier on these 
images. What is the difference in performance? Can you suggest improvements to 
the image descriptor? 

3. Try to vary the number of dimensions after PCA projection of the gesture recogni- 
tion features for the Bayes classifier. What is a good choice? Plot the singular values 
S, they should give a typical “knee” shaped curve like the one shown in Figure 8-9. 
A good compromise between ability to generate the variability of the training data 
and keeping the number of dimensions low is usually found at a number before 
the curve flattens out. 
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Figure 8-9. Graph for Exercise 3. 

4. Modify the Bayes classifier to use a different probability model than Gaussian 
distributions. For example, try using the frequency counts of each feature in the 
training set. Compare the results to using a Gaussian distribution for some different 
datasets. 

5. Experiment with using non-linear SVMs for the gesture recognition problem. Try 
polynomial kernels and increase the degree (using the “-d” parameter) incremen- 
tally What happens to the classification performance on the training set and the 
test set? With a non-linear classifier, there is a risk of training and optimizing it for 
a speci fic set so that performance is close to perfect on the training set, but the clas¬ 
sifier has poor performance on other test sets. This phenomenon of breaking the 
generalization capabilities of a classifier is called overfitting and should be avoided. 

6. Try some more advanced feature vectors for the Sudoku character recognition 
problem. If you need inspiration, look at [4]. 

7. Implement a method for automatically aligning the Sudoku grid. Try, for exam¬ 
ple, feature detection with RANSAC, line detection, or detecting the cells using 
morphological and measurement operations from scipy.ndimage {http://docs.scipy 
.orgldoclscipylreferencelndimage.html). Bonus task: Solve the rotation ambiguity of 
finding the “up” direction. For example, you could try rotating the recti fied grid 
and let the OCR classifier s accuracy vote for the best orientation. 

8. For a more challenging classification problem than the Sudoku digits, take a look 
at the MNIST database of handwritten digits http://yann.lecun.com/exdb/mnist/. Try 
to extract some features and apply SVM to that set. Check where your performance 
ends up on the ranking of best methods (some are insanely good). 

9. If you want to dive deeper into classifiers and machine learning algorithms, take 
a look at the scikit.learn package {http://scikit-learn.org/) and try some of the 
algorithms on the data in this chapter. 
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CHAPTER9 


Image Segmentation 


Image segmentation is the process of partitioning an image into meaningful regions. 
Regions can be foreground versus background or individual objects in the image. The 
regions are constructed using some feature such as color, edges, or neighbor similarity 
In this chapter we will look at some different techniques for segmentation. 

9.1 Graph Cuts 

A graph is a set of nodes (sometimes called vertices) with edges between them. 
See Figure 9-1 for an example.^ The edges can be directed (as illustrated with arrows in 
Figure 9-1) or undirected, and may have weights associated with them. 

A graph cut is the partitioning of a directed graph into two disjoint sets. Graph cuts 
can be used for solving many different computer vision problems like stereo depth 
reconstruction, image stitching, and image segmentation. By creating a graph from 
image pixels and their neighbors and introducing an energy or a “cost,” it is possible to 
use a graph cut process to segment an image in two or more regions. The basic idea is 
that similar pixels that are also close to each other should belong to the same partition. 

The cost of a graph cut C (where C is a set of edges) is defined as the sum of the edge 
weights of the cuts 

Ecut= XI “'T’ 

(iJ)eC 

where Wjj is the weight of the edge (i, j) from node i to node j in the graph and the 
sum is taken over all edges in the cut C. 

The idea behind graph cut segmentation is to partition a graph representation of the 
image such that the cut cost is minimized. In this graph representation, two 
additional nodes, a source and a sink node, are added to the graph and only cuts that 
separate the source and sink are considered. 

^ You also saw graphs in action in Section 23. This time we are going to use them to partition images. 


191 




Figure 9-1. A simple directed graph created using python-graph. 


Finding the minimum cut (or min cut) is equivalent to finding the maximum flow (or max 
flow) between the source and the sink (see [2] for details). There are efficient algorithms 
for solving these max flow/min cut prohlems. 

For our graph cut examples we will use the python-graph package. This package con- 
tains many useful graph algorithms. The website with downloads and documentation 
is http://code.google.eom/p/python-graph/. We will need the function maxiirium_flow(), 
which computes the max flow/min cut using the Edmonds-Karp algorithm {http://en 
.wikipedia.org/wiki/Edmonds-Karp_algorithm). The good thing about using a package 
written fully in Python is ease of installation and compatibility; the downside is speed. 
Performance is adequate for our purposes, but for anything but small images, a faster 
implementation is needed. 

Fleres a simple example of using python-graph to compute the max flow/min cut of a 
small graph.^ 

from pygraph.classes.digraph import digraph 
from pygraph.algorithms.minttiax import maximum_flow 

gr = digraphO 
gr.add_nodes([o,1,2,3]) 

gr.add_edge((o,l), wt=4) 
gr.add_edge((l,2), wt=3) 
gr.add_edge((2,3), wt=5) 
gr.add_edge((o,2), wt=3) 
gr.add_edge((l,3), wt=4) 

flows,cuts = maximum_flow(gr,0,3) 
print 'flow is:', flows 
print 'cut is:', cuts 

First, a directed graph is created with four nodes with index 0 ... 3. Then the edges 
are added using add_edge() with an edge weight specified. This will be used as the 


2 


Same graph as the example at http://en.wikipedia.org/iviki/Max-ftow_min-cut_theorem. 
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maximum flow capadty of the edge. The maximum flow is computed with node 0 as 
source and node 3 as sink. The flow and the cuts are printed and should look like this: 

flow is: {( 0 , 1 ): 4, (l, 2 ): 0 , (l, 3): 4, (2, 3): 3, (o, 2 ): 3} 
cut is: {o: 0 , 1 : 1 , 2 : 1 , 3: 1} 

These two python dictionaries contain the flow through each edge and the label for 
each node: 0 for the part of the graph containing the source, 1 for the nodes connected 
to the sink. You can verify manually that the cut is indeed the minimum. The graph is 
shown in Figure 9T. 

Graphsfrom Images 

Given a neighborhood structure, we can define a graph using the image pixels as nodes. 
Here we will focus on the simplest case of 4-neighborhood of pixels and two image 
regions (which we can call foreground and background). A 4-neighborhood is where a 
pixel is connected to the pixels directly above, below, left, and right.^ 

In addition to the pixel nodes, we will also need two special nodes: a “source” node and 
a “sink” node, representing the foreground and background, respectively We will use a 
simple model where all pixels are connected to the source and the sink. 

Heres how to build the graph: 

• Every pixel node has an incoming edge from the source node. 

• Every pixel node has an outgoing edge to the sink node. 

• Every pixel node has one incoming and one outgoing edge to each of its neighbors. 

To determine the weights on these edges, you need a segmentation model that deter¬ 
mines the edge weights (representing the maximum flow allowed for that edge) between 
pixels and between pixels and the source and sink. As before, we call the edge weight 
between pixel i and pixel j, Wij. Lets call the weight from the source to pixel i, w^i, 
and from pixel i to the sink, Wn. 

Lets look at using a naive Bayesian classi fler from Section 8.2 on the color values of 
the pixels. Given that we have trained a Bayes dassifier on foreground and background 
pixels (from the same image or from other images), we can compute the probabilities 
Pp{Ii) and Psih) for the foreground and background. Here /,• is the color vector of 
pixel i. 

We can now create a model for the edge weights as follows: 

^ PFih) 

PpUd + PEdi) 

— 

PpiU) + Psih) 


^ Another common option is 8-neighborhood, where the diagonal pixels are also connected. 
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With this model, each pixel is connected to the foreground and background (source 
and sink) with weights equal to a normalized probability of belonging to that class. 
The Wij describe the pixel similarity between neighbors; similar pixels have weight 
close to K, dissimilar close to 0. The parameter a determines how fast the values decay 
toward zero with increasing dissimilarity 

Create a file graphcut.py and add the following function that creates this graph from an 
image: 


from pygraph.classes.digraph import digraph 
from pygraph.algorithms.minmax import maximum_flow 

import bayes 

def build_bayes_graph(im,labels,sigma=le2,kappa=2): 

Build a graph from 4-neighborhood of pixels. 

Foreground and background is determined from 

labeis (l for foreground, -1 for background, 0 othermse) 

and is modeled with naive Bayes classifiers.""" 

m,n = im.shape[:2] 

It RGB vector version (one pixel per row) 
vim = im.reshape((-l,3)) 

It RGB for foreground and background 
foreground = im[labels==l].reshape((-l,3)) 
background = im[labels==-l].reshape((-l,3)) 
train_data = [foreground,background] 

# train naive Bayes classifier 
bc = bayes.BayesClassifierO 
bc.train(train_data) 

# get probabilities for all pixels 
bc_lables,prob = bc.classify(vim) 
prob_fg = prob[o] 

prob_bg = prob[l] 

# create graph with m*n+2 nodes 
gr = digraph0 

gr.add_nodes(range(m*n+2)) 

source = m*n # second to last is source 
sink = m*n+l # last node is sink 

tt normalize 

for i in range(vim.shape[o]): 
vim[i] = vim[i] / linalg.norm(vim[i]) 

tt go tbrougb all nodes and add edges 
for i in range(m*n): 
tt add edge from source 

gr.add_edge((source,i), wt=(prob_fg[i]/(prob_fg[i] +prob_bg[i]))) 
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# add edge to sink 

gr.add_edge((i,sink), wt=(prob_bg[i]/(prob_fg[i]+prob_bg[i]))) 

# add edges to neighbors 
if i%n != 0: # left exists 

edge_wt = kappa*exp(-1.0*sum((vim[i]-vim[i-l])’''*2)/sigma) 
gr.add_edge((i,i-l), wt=edge_wt) 
if (i+l)%n != 0: # right exists 
edge_wt = kappa*exp(-1.0*sum((vim[i]-vitti[i+l])’''*2)/sigma) 
gr.add_edge((i,i+l), wt=edge_wt) 
if i//n != 0: tt up exists 

edge_wt = kappa*exp(-1.0*sum((vim[i]-vitti[i-n])’''*2)/sigma) 
gr.add_edge((i,i-n), wt=edge_wt) 
if i//n != m-1: # dom exists 
edge_wt = kappa*exp(-1.0*sum((vim[i]-vim[i+n])’''*2)/sigma) 
gr.add_edge((i,i+n), wt=edge_wt) 

return gr 

Here we used a label image with values 1 for foreground training data and —1 for 
background training data. Based on this labeling, a Bayes classifier is trained on the 
RGB values. Then classification probabilities are computed for each pixel. These are 
then used as edge weights for the edges going from the source and to the sink. A graph 
with n *m + 2 nodes is created. Note the index of the source and sink; we choose them 
as the last two to simplify the indexing of the pixels. 

To visualize the labeling overlaid on the image we can use the function contourf(), 
which filis the regions between contour levels of an image (in this case the label image). 
The alpha variable sets the transparency Add the following function to graphcut.py. 

def show_labeling(im,labeis): 

Show image with foreground and background areas, 
labeis = 1 for foreground, -1 for background, 0 otherwise. 

imshow(im) 

contour(labels,[-0.5,0.5]) 

contourf(labeis,[-1,-0.5],colors='b',alpha=0.25) 
contourf(labeis,[o.5,1],colors='r',alpha=0.25) 
axis('off') 

Once the graph is built, it needs to be cut at the optimal location. The following function 
computes the min cut and reformats the output to a binary image of pixel labeis: 

def cut_graph(gr,imsize): 

Solve max flow of graph gr and return binary 
labeis of the resulting segmentation.""" 

m,n = imsize 

source = m^n # second to last is source 
sink = m*n-H # last is sink 

# cut the graph 

flows,cuts = maximum_flow(gr,source,sink) 
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# convert graph to image uith labeis 
res = zeros(tti*n) 

for poSjlabel in cuts.items()[:-2]: #don't add source/sink 
res[pos] = label 

return res.reshape((m,n)) 

Again, note the indices for the source and sink. We need to take the size of the image 
as input to compute these indices and to reshape the output before returning the 
segmentation. The cut is returned as a dictionary, which needs to be copied to an image 
of segmentation labeis. This is done using the .items() method that returns a list of (key, 
value) pairs. Again we skip the last two elements of that list. 

Lets see how to use these functions for segmenting an image. The following is a 
complete example of reading an image and creating a graph with class probabilities 
estimated from two rectangular image regions: 

from scipy.misc import imresize 
import graphcut 

im = array(Image.open( 'empire.jpg' )) 
im = imresize(im,0.07,interp='bilinear') 
size = im.shape[:2] 

# add tuo rectangular training regions 
labeis = zeros(size) 

Iabels[3:l8,3:l8] = -1 
labels[-l8:-3,-l8:-3] = 1 

# create graph 

g = graphcut.build_bayes_graph(im,labeis,kappa=l) 

# cut the graph 

res = graphcut.cut_graph(g,size) 


figureO 

graphcut. show_labeling(ini, labeis) 

figureO 
imshow(res) 
grayO 
axis( 'off') 

show() 

We use the imresizef) function to make the image small enough for our Python graph 
library in this case uniform scaling to 7% of the original size. The graph is cut and the 
resuit plotted together with an image showing the training regions. Figure 9-2 shows 
the training regions overlaid on the image and the final segmentation resuit. 

The variable kappa {k in the equations) determines the relative weight of the edges 
between neighboring pixels. The effect of changing kappa can be seen in Figure 9-3. 
With increasing value, the segmentation boundary will be smoother and details will 
be lost. Choosing the right value is up to you. The right value will depend on your 
application and the type of resuit you desire. 
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Figure 9-2. An example of graph cut segmentation using a Bayesian probability model. Image is 
downsampled to size 54 x 38. Label image for model training (left); training regions shown on the 
image (middle); segmentation (right). 



(a) (b) (c) (d) 


Figure 9-3. The effect of changing the relative weighting between pixel similarity and class probability. 
The same segmentation as in Figure 9-2 with: (a) k — l, (b) k — 1, (c) k = 5, and (d) k = 10. 


Segmentation with User Input 

Graph cut segmentation can be combined with user input in a number of ways. For 
example, a user can supply markers for foreground and background by drawing on an 
image. Another way is to select a region that contains the foreground with a bounding 
box or using a “lasso” tool. 

Lets look at this last example using some images from the Grab Cut dataset from 
Microsoft Research Cambridge; see [27] and Appendix B.5 for details. 

These images come with ground truth labeis for measuring segmentation performance. 
They also come with annotations simulating a user selecting a rectangular image region 
or drawing on the image with a “lasso” type tool to mark foreground and background. 
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We can use these user inputs to get training data and apply graph cuts to segment the 
image guided by the user input. 

The user input is encoded in bitmap images with the following meaning: 

Pixelvalue Meaning 

0,64 background 

128 unknown 

255 foreground 

Heres a complete code example of loading an image and annotations and passing that 
to our graph cut segmentation routine: 

from scipy.misc import imresize 
import graphcut 

def create_msr_labels(m,lasso=False): 

Create label matrix for training from 
user annotations. """ 

labeis = zeros(im.shape[:2]) 

# background 
labels[m==o] = -1 
labels[m==64] = -1 

# foreground 
if lasso: 

labels[m==255] = 1 
else: 

labels[m==128] = 1 
return labeis 


# load image and annotation map 

im = array(Image.open('376043.jPg')) 

m = array(Iniage.open(' 376043 .bmp' )) 

# resize 
scale = 0.1 

im = imresize(im, scale,interp=' bilinear') 
m = imresize(m,scale,interp='nearest') 

# create training labeis 

labeis = create_msr_labels(m,False) 

# build graph using annotations 

g = graphcut.build_bayes_graph(im,labeis,kappa=2) 

# cut graph 

res = graphcut.cut_graph(g,im.shape[:2]) 

# remove parts in background 
res[m==o] = 1 

res[m==64] = 1 
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# plot the resuit 

figureO 

imshow(res) 

grayO 

xticks([]) 

yticks([]) 

savefig('labelplot.pdf') 

First, we define a helper function to read the annotation images and format them so 
we can pass them to our function for training background and foreground models. The 
bounding rectangles contain only background labeis. In this case, we set the foreground 
training region to the whole “unknown” region (the inside of the rectangle). Next, we 
build the graph and cut it. Since we have user input, we remove results that have any 
foreground in the marked background area. Last, we plot the resulting segmentation 
and remove the tick markers by setting them to an empty list. That way we get a nice 
bounding box (otherwise the boundaries of the image will be hard to see in this black- 
and-white plot). 

Figure 9-4 shows some results using RGB vector as feature with the original image, a 
downsampled mask, and downsampled resulting segmentation. The image on the right 
is the plot generated by the script above. 



Figure 9-4. Sample graph cut segmentation results using images from the Grab Cut data set: original 
image, downsampled (left); mask usedfor training (middle); resulting segmentation using RGB values 
as feature vectors (right). 
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9.2 Segmentation Using Clustering 


The graph cut formulation in the previous section solves the segmentation problem by 
finding a discrete solution using max flow/min cut over an image graph. In this section 
we will look at an alternative way to cut the image graph. The normalized cut algorithm, 
based on spectral graph theory, combines pixel similarities with spatial proximity to 
segment the image. 

The idea comes from defining a cut cost that takes into account the size of the groups 
and “normalizes” the cost with the size of the partitions. The normalized cut formula¬ 
tion modifies the cut cost of equation (9.1) to 


p — 
^ncut 


E 


+ 


ieA ^ix 


^jeB ^jx 


where A and B indicate the two sets of the cut and the sums add the weights from A 
and B, respectively, to all other nodes in the graph (which are pixels in the image in 
this case). This sum is called the association and for images where pixels have the same 
number of connections to other pixels, it is a rough measure of the size of the partitions. 
In the paper [32], the cost function above was introduced together with an algorithm 
for finding a minimizer. The algorithm is derived for two-class segmentation and will 
be described next. 


Define W as the edge weight matrix with elements Wij containing the weight of the 
edge connecting pixel i with pixel j. Let D be the diagonal matrix of the row sums of 
S, D = diagidj), dj = Wij (same as in Section 63). The normalized cut segmentation 
is obtained as the minimum of the optimization problem 


min 

y 


y^(D-W)y 

yTDy 


where the vector y contains the discrete labeis that satisfy the constraints y,- G (1, —b) 
for some constant b (meaning that y only takes two discrete values) and y^D sum to 
zero. Because of these constraints, this is not easily solvable."* 

However, by relaxing the constraints and letting y take any real value, the problem 
becomes an eigenvalue problem that is easily solved. The drawback is that you need to 
threshold or cluster the output to make it a discrete segmentation again. 

Relaxing the problem results in solving for eigenvectors of a Laplacian matrix 




just like the spectral clustering case. The only remaining difficulty is now to define the 
between-pixel edge weights Wjj. Normalized cuts have many similarities to spectral 
clustering and the underlying theory overlaps somewhat. See [32] for an explanation 
and the details. 


In fact, this problem is NP-hard. 
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Lets use the edge weights from the original normalized cuts paper [32]. The edge weight 
connecting two pixels i and j is given by 

yi}.. = 

u 

The first part measures the pixel similarity between the pixels with /,■ and Ij denoting 
either the RGB vectors or the grayscale values. The second part measures the proximity 
between the pixels in the image with x,- and Xj denoting the coordinate vector of each 
pixel. The scaling factors cr^ and uj determine the relative scales and how fast each 
component approaches zero. 

Lets see what this looks like in code. Add the following function to a file ncut.py: 

def ncut_graph_matrix(im,sigma_d=le2,sigma_g=le-2): 

Create matrix for normalized cut. The parameters are 
the weights for pixel distance and pixel similarity. """ 

m,n = im.shape[:2] 

N = m*n 


# normalize and create feature vector of RGB or grayscale 
if len(im.shape)==3: 

for i in range(3): 

ini[;,:,i] = / im[:,:,i].max() 

vim = im.reshape((-l,3)) 
else: 

im = im / im.max() 
vim = im.flattenO 

# x,y coordinates for distance computation 
XX,yy = meshgrid(range(n),range(m)) 

x,y = xx.flatten(),yy.flatten() 

# create matrix with edge weights 
W = zeros((N,N),'f') 

for i in range(N): 
for j in range(i,N): 
d = (x[i]-x[j])**2 + (y[i]-y[j])**2 

W[ijj] = W[j,i] = exp(-1.0*sum((vim[i]-vim[j])’''*2)/sigma_g) * exp(-d/sigma_d) 
return W 

This function takes an image array and creates a feature vector using either RGB values 
or grayscale values depending on the input image. Since the edge weights contain a 
distance component, we use meshgridO to get the a: and y values for each pixel feature 
vector. Then the function loops over all N pixels and filis out the values in the N x N 
normalized cut matrix W. 

We can compute the segmentation either by sequentially cutting each eigenvector or by 
taking a number of eigenvectors and apply clustering. We chose the second approach, 
which also works without modification for any number of segments. We take the top 
ndim eigenvectors of the Laplacian matrix corresponding to W and cluster the pixels. 
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The following function implements the clustering. As you can see, it is almost the same 
as the spectral clustering example in Section 6.3: 


from scipy.cluster.vq import * 
def cluster(S,k)ndim): 

Spectral clustering from a similarity matrix.""" 

# check for symmetry 

if sum(abs(S-S.T)) > le-10: 
print 'not symmetric' 

# create Laplacian matrix 
rowsum = sum(abs(S),axis=o) 

D = diag(l / sqrt(rowsum + le-6)) 

L = dot(D,dot(S,D)) 

# compute eigenvectors of L 
U,sigma,V = linalg.svd(L) 

# create feature vector from ndim first eigenvectors 

# by stacking eigenvectors as columns 
features = array(V[:ndim]).T 

# k-means 

features = whiten(features) 

centroids, distortiori = kmeans(features,k) 

code,distance = vq(features,centroids) 

return code,V 

Here we used the ^-means clustering algorithm (see Section 6.1 for details) to group 
the pixels based on the values in the eigenvector images. You could try any clustering 
algorithm or grouping criteria if you feel like experimenting with the results. 

Now we are ready to try this on some sample images. The following script shows a 
complete example: 

import ncut 

from scipy.misc import imresize 


im = array(Image.open('C-uniform03.ppm')) 
m,n = im.shape[:2] 

# resize image to (widjWid) 
wid = 50 

rim = imresize(im, (wid,wid),interp=' bilinear') 
rim = array(rim,'f') 

# create normalized cut matrix 

A = ncut.ncut_graph_matrix(rim,sigma_d=l,sigma_g=le-2) 

# cluster 

code,V = ncut.cluster(A,k=3,ndim=3) 
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# reshape to original image size 

codeim = imresize(code.reshape(wid,wid),(m,n),interp='nearest') 

# plot resuit 
figureO 
imshow(codeim) 
grayO 
show() 

Here we resize the image to a fixed size (50 x 50 in this example) in order to make 
the eigenvector computation fast enough. The NumPy linalg.svd() function is not fast 
enough to handle large matrices (and sometimes gives inaccurate results for too large 
matrices). We use bilinear interpolation when resizing the image, but nearest neighbor 
interpolation when resizing the resulting segmentation label image, since we dont want 
to interpolate the class labeis. Note the use of first reshaping the one-dimensional array 
to {wid,wid) followed by resizing to the original image size. 

In the example, we used one of the hand gesture images from the Static Hand Posture 
Database (see Section 8.1 for more details) with k = 3. The resulting segmentation is 
shown in Figure 9-5 together with the first four eigenvectors. 

The eigenvectors are returned as the array V in the example and can be visualized as 
images like this: 

imshow(imresize(V[i].reshape(wid,wid),(m,n),interp='bilinear')) 

This will show eigenvector i as an image at the original image size. 

Figure 9-6 shows some more examples using the same script above. The airplane image 
is from the “airplane” category in the Caltech 101 dataset. For these examples, we kept 
the parameters cr^ and cr^ to the same values as above. Changing them can give you 



Figure 9-5. Image segmentation using the normalized cuts algorithm: the original image and the 
resulting three-class segmentation (top); the first four eigenvectors of the graph similarity matrix 
(bottom). 
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Figure 9-6. Examples of two-class image segmentatiori using the normalized cuts algorithm: original 
image (left); segmentation resuit (right). 


smoother, more regularized results and quite different eigenvector images. We leave the 
experimentation to you. 

It is worth noting that even for these fairly simple examples, a thresholding of the image 
would not have given the same resuit; neither would clustering the RGB or graylevel 
values. This is because neither of these take the pixel neighborhoods into account. 


9.3 Variational Methods 

In this book, you have seen a number of examples of minimizing a cost or energy to 
solve computer vision problems. In the previous sections it was minimizing the cut in a 
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Figure 9-7. The piecewise constant Chan-Vese segmentation model. 


graph, but we also saw examples like the ROF de-noising, ^-means, and support vector 
machines. These are examples of optimization problems. 

When the optimization is taken over functions, the problems are called variational 
problems, and algorithms for solving such problems are called variational methods. Let s 
look at a simple and effective variational model. 

The Chan-Vese segmentation model [6] assumes a piecewise constant image model for 
the image regions to be segmented. Here we will focus on the case of two regions, for 
example foreground and background, but the model extends to multiple regions as 
well; see, for example, [38]. The model can be described as follows. 

If we let a collection of curves T separate the image into two regions and ^2 in 
Figure 9-7, the segmentation is given by minima of the Chan-Vese model energy 

^(r) = 7. length(r) -f f (I — c{)^dx-{- j (I — C 2 )^dx, 

which measures the deviation from the constant graylevels in each region, cj and C 2 . 
Here the integrals are taken over each region and the length of the separating curves are 
there to prefer smoother Solutions. 

With a piecewise constant image U = XiCi -F X 2 (^ 2 ’ ^his can we re-written as 

= j \yu\dTi + \\i -u\\^, 

where Xi and X 2 are characteristic (indicator) functions for the two regions.^ This 
transformation is non-trivial and requires some heavy mathematics that are not needed 
for understanding and that are well outside the scope of this book. 

The point is that this equation is now the same as the ROF equation (1.1) with X replaced 
by 7|ci — C 2 I. The only difference is that in the Chan-Vese case we are looking for an 
image U that is piecewise constant. It can be shown that thresholding the ROF solution 
will give a good minimizer. The interested reader can check [8] for the details. 


^ Characteristic functions are 1 in the region and 0 outside. 
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Figure 9-8. Examples image segmentatiori by minimizing the Chan-Vese model using ROF de-noising: 
(a) original image; (b) image after ROF de-noising; (c) final segmentation. 


Minimizing the Chan-Vese model now becomes a ROF de-noising followed by 
thresholding: 

import rof 

im = array(Image.open('ceramic-houses_t0.png').convert("L")) 

U,T = rof.denoise(im,im,tolerance=0.00l) 
t = 0.4 ttthreshold 

import scipy.misc 

scipy.misc.imsave('resuit.pdf',LI < t*Ll.max()) 

In this case, we turn down the tolerance threshold for stopping the ROF iterations to 
make sure we get enough iterations. Figure 9-8 shows the resuit on two rather difficult 
images. 

Exercises 

1. It is possible to speed up computation for the graph cut optimization by reducing 
the number of edges. This graph construction is described in Section 4.2 of [16]. 
Try this out and measure the difference in graph size and in segmentation time 
compared to the simpler construction we used. 
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2. Create a user interface or simulate a user selecting regions for graph cut segmen¬ 
tatiori. Then try “hard coding” background and foreground by setting weights to 
some large value. 

3. Change the feature vector in the graph cut segmentation from a RGB vector to some 
other descriptor. Can you improve on the segmentation results? 

4. Implement an iterative segmentation approach using graph cut where a current 
segmentation is used to train new foreground and background models for the next. 
Does it improve segmentation quality? 

5. The Microsoft Research Grab Cut dataset contains ground truth segmentation 
maps. Implement a function that measures the segmentation error and evaluate 
different settings and some of the ideas in the exercises above. 

6. Try to vary the parameters of the normalized cuts edge weight and see how they 
affect the eigenvector images and the segmentation resuit. 

7. Compute image gradients on the first normalized cuts eigenvectors. Combine these 
gradient images to detect image contours of objects. 

8. Implement a linear search over the threshold value for the de-noised image in 
Chan-Vese segmentation. For each threshold, store the energy SfT) and pick the 
segmentation with the lowest value. 
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CHAPTER10 


OpenCV 


This chapter gives a brief overview of how to use the popular computer vision li- 
brary OpenCV through the Python interface. OpenCV is a C++ library for real-time 
computer vision initially developed by Intel and now maintained by Willow Carage. 
OpenCV is open source and released under a BSD license, meaning it is free for both 
academic and commercial use. As of version 2.0, Python support has been greatly im- 
proved. We will go through some basic examples and look deeper into tracking and 
video. 


10.1 The OpenCV Python Interface 

OpenCV is a C++ library with modules that cover many areas of computer vision. 
Besides C++ (and C), there is growing support for Python as a simpler scripting 
language through a Python interface on top of the C++ code base. The Python interface 
is stili under development—not all parts of OpenCV are exposed and many functions 
are undocumented. This is likely to change, as there is an active community behind 
this interface. The Python interface is documented at http://opencv.willowgarage.com/ 
documentation/python/index.html. See Appendix A for installation instructions. 

The current OpenCV version (2.3.1) actually comes with two Python interfaces. The 
old cv module uses internal OpenCV datatypes and can be a little tricky to use from 
NumPy. The new cv2 module uses NumPy arrays and is much more intuitive to use.'^ The 
module is available as 

import cv2 

and the old module can be accessed as 

import CV2.CV 


^ The names and location of these two modules are likely to change over time. Check the online documentation 
for changes. 
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We will focus on the cv2 module in this chapter. Look out for future name changes, as 
well as changes in function names and definitions in future versions. OpenCV and the 
Python interface is under rapid development. 

10.2 OpenCV Basies 

OpenCV comes with functions for reading and writing images, as well as matrix 
operations and math libraries. For the details on OpenCV, there is an excellent book 
[3] (C++ only). Lets look at some of the basic components and how to use them. 

Reading and Writing Images 

This short example will load an image, print the size, and convert and save the image 
in .png format: 

import cv2 

# read image 

im = cv2.imread( 'empire.jpg' ) 
h,w = im.shape[:2] 
print h,w 

It save image 

cv2.imwrite('resuit.png' ,im) 

The function imread() returns the image as a Standard NumPy array and can handle a 
wide range of image formats. You can use this function as an alternative to the PIL image 
reading if you like. The function imwrite() automatically takes care of any conversion 
based on the file ending. 

Color Spaces 

In OpenCV images are not stored using the conventional RGB color channels; they are 
stored in BGR order (the reverse order). When reading an image the default is BGR; 
however, there are several conversions available. Color space conversions are done using 
the function cvtColor(). For example, converting to grayscale is done like this: 

im = cv2.imread('empire.jpg') 

# create a grayscale version 

gray = cv2.cvtColor(im,cv2.C0L0R_BGR2GRAY) 

After the source image, there is an OpenCV color conversion code. Some of the most 
useful conversion codes are: 

• cv2.COLOR_BGR2GRAY 

• cv2.COLOR_BGR2RGB 

• cv2.COLOR_GRAY2BGR 

In each of these, the number of color channels for resulting images will mateh the 
conversion code (single channel for gray and three channels for RGB and BGR). The last 
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Figure 10-1. Example of computing an integrat image using OpenCVs integral() function. 


version converts grayscale images to BGR and is useful if you want to plot or overlay 
colored objects on the images. We will use this in the examples. 

Displaying Images and Results 

Lets look at some examples of using OpenCV for image processing and how to show 
results with OpenCV plotting and window management. 

The first example reads an image from file and creates an integral image representation: 
import cv2 

# read image 

im = cv2.imread( 'fisherman.jpg' ) 

gray = cv2.cvtColor(im,cv2.C0L0R_BGR2GRAY) 

# compute integral image 
intim = cv2.integral(gray) 

# normalize and save 

intim = (255.0*intim) / intim.max() 
cv2.imwrite('resuit.jpgintim) 

After reading the image and converting to grayscale, the function integral () creates an 
image where the value at each pixel is the sum of the intensities above and to the left. 
This is a very useful trick for quickly evaluating features. Integral images are used in 
OpenCVs CascadeClassifier, which is based on a framework introduced by Viola and 
Jones [39]. Before saving the resulting image, we normalize the values to 0 . . . 255 by 
dividing with the largest value. Figure 10-1 shows the resuit for an example image. 

The second example applies flood filling starting from a seed pixel: 
import cv2 

# read image 

filename = 'fisherman.jpg' 
im = cv2.imread(filename) 
h,w = im.shape[:2] 
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Figure 10-2. Floodfill of a color image. The highlighted area in the right panel marks ali pixels filled 
using a single seed in the upper-left corner. 


# flood fili example 
diff = (6,6,6) 

mask = zeros((h+2,w+2),uint8) 

cv2.floodFili(im,mask,(10,10), (255,255,0),diff,diff) 

# shou the resuit in an OpenCV window 
cv2.imshow('flood fill',im) 
cv2.waitKey() 

# save the resuit 
cv2.imwrite('resuit.jpg',im) 

This example applies flood fili to the image and shows the resuit in an OpenCV window. 
The function waitKeyO pauses until a key is pressed and the window is automatically 
closed. Here the function flood Fili () takes the image (grayscale or color), a mask with 
non-zero pixels indicating areas not to be filled, a seed pixel, and the new color value 
to replace the flooded pixels together with lower and upper difference thresholds to 
accept new pixels. The flood fili starts at the seed pixel and keeps expanding as long 
as new pixels can be added within the difference thresholds. The difference thresholds 
are given as tuples (R,G,B). The resuit looks like Figure 10-2. 

As a third and final example, we look at extracting SURF features, a faster version of 
SIFT introduced by [1]. Here we also show how to use some basic OpenCV plotting 
commands: 

import cv2 

# read image 

im = cv2.imread( 'empire.jpg' ) 

# dounsample 

im_lowres = cv2.pyrDown(im) 

# convert to grayscale 

gray = cv2.cvtColor(ini_lowres,cv2.C0L0R_RGB2GRAY) 
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Figure 10-3. Sample SURF features extmcted and plotted using OpenCV. 


# detect feature points 
s = cv2.SURF() 

mask = uint8(ones(gray.shape)) 
keypoints = s.detect(gray,mask) 

# Show image and points 

vis = cv2.cvtColor(gray,cv2.C0L0R_GRAY2BGR) 

for k in keypoints[:: 10]: 

CV2. circle(vis,(int(k.pt[o]),int(k.pt[l])), 2,(0,255,0), -l) 

cv 2 .circle(vis,(int(k.pt[o]),int(k.pt[l])),int(k.size),(o, 255 ,o), 2 ) 

cv2.imshow('local descriptors',vis) 
cv2.waitKey() 

After reading the image, it is downsampled using the function pyrDown(), which, if no 
new size is given, creates a new image half the size of the original. Then the image 
is converted to grayscale and passed to a SURF keypoint detection object. The mask 
determines what areas to apply the keypoint detector. To plot, we convert the grayscale 
image to a color image and use the green channel for plotting the keypoints. We loop 
over every tenth keypoint and plot a circle at the center and one circle showing the 
scale (size) of the keypoint. The plotting function circle() takes an image, a tuple with 
image coordinates (integer only), a radius, a tuple with plot color, and finally the line 
thickness (—1 gives a solid circle). Figure 10-3 shows the resuit. 


10.3 Processing Video 

Video with pure Python is hard. There are speed, codecs, cameras, operating systems, 
and file formats to consider. There is currently no video library for Python. OpenCV 
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with its Python interface is the only good option. In this section, we’ll look at some 
basic examples using video. 


Video Input 

Reading video from a camera is very well supported in OpenCV A basic complete 
example that captures frames and shows them in an OpenCV window looks like this: 

import cv2 

# Setup video capture 
cap = cv2.VideoCapture(o) 

while True: 
retjii" = cap.readO 
cv2.imshow('video test',im) 
key = cv2.waitKey(lo) 
if key == 27: 
break 

if key == ord(' '): 
cv2.imwrite( 'vid_result.jpg' ,im) 

The capture object VideoCapture captures video from cameras or files. Here we pass an 
integer at initialization. This is the id of the video device; with a single camera connected 
this is 0. The method read() decodes and returns the next video frame. The first value 
is a success flag and the second the actual image array The waitKeyO function waits 
for a key to be pressed and quits the application if the ‘Esc’ key (Ascii number 27) is 
pressed, or saves the frame if the ‘space’ key is pressed. 

Let’s extend this example with some simple processing by taking the camera input and 
showing a blurred (color) version of the input in an OpenCV window. This is only a 
slight modification to the base example above: 

import cv2 

# Setup video capture 
cap = cv2.VideoCapture(o) 

# get frame, apply Gaussian smoothing, show resuit 
while True: 

retjini = cap.readO 
blur = cv2.GaussianBlur(im,(o,o),5) 
cv2.imshow('camera blur',blur) 
if cv2.waitKey(lo) == 27: 
break 

Each frame is passed to the function GaussianBlur(), which applies a Gaussian filter to 
the image. In this case, we are passing a color image so each color channel is blurred 
separately The function takes a tuple for filter size and the Standard deviation for the 
Gaussian function (in this case 5). If the filter size is set to zero, it will automatically be 
determined from the Standard deviation. The resuit looks like Figure 10-4. 
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Figure 10-4. Screenshot ofa blurred video ofthe author as he’s writing this chapter. 


Reading video from files works the same way but with the call to VideoCapture() taking 
the video filename as input: 

capture = cv2.VideoCapture('filename') 

Reading Video to NumPy Arrays 

Using OpenCV, it is possible to read video frames from a file and convert them to NumPy 
arrays. Here is an example of capturing video from a camera and storing the frames in 
a NumPy array: 

import cv2 

# Setup video capture 
cap = cv2.VideoCapture(o) 

frames = [] 

# get framej store in array 
while True: 

retjim = cap.read() 
cv2.imshow('video',im) 
frames.append(im) 
if cv2.waitKey(lo) == 27: 
break 

frames = array(frames) 

# check the sizes 
print im.shape 
print frames.shape 

Each frame array is added to the end of a list until the capturing is stopped. The resulting 
array will have size (number of frames, height, width, 3). The printout confirms this: 

(480, 640, 3) 

(40, 480, 640, 3) 

In this case, there were 40 frames recorded. Arrays with video data like this are useful 
for video processing, such as in computing frame differences and tracking. 
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10.4 Tracking 

Tracking is the process of following objects through a sequence of images or video. 

Optical Flow 

Opticalflow (sometimes called optic flow) is the image motion of objects as the objects, 
scene, or camera move between two consecutive images. It is a 2D vector field of within- 
image translation. It is a classic and well-studied field in computer vision with many 
successful applications in, for example, video compression, motion estimation, object 
tracking, and image segmentation. 

Optical flow relies on three major assumptions: 

1. Brightness constancy: The pixel intensities of an object in an image do not change 
between consecutive images. 

2. Temporal regularity: The between-frame time is short enough to consider the mo¬ 
tion change between images using differentials (used to derive the Central equation 
below). 

3. Spatial consistency: Neighboring pixels have similar motion. 

In many cases, these assumptions break down, but for small motions and short time 
steps between images it is a good model. Assuming that an object pixel I(x, y, t) 
at time t has the same intensity at time t + 8t after motion [8x, (5y]) means that 
I{x, y,t) = I{x + &x, y + (5y, t + St). Differentiating this constraint gives the optical 
flow equation: 

V/^v = -/„ 

where v = [m, f] is the motion vector and the time derivative. For individual points 
in the image, this equation is under-determined and cannot be solved (one equation 
with two unknowns in v). By enforcing some spatial consistency, it is possible to 
obtain Solutions, though. In the Lucas-Kanade algorithm below, we will see how that 
assumption is used. 

OpenCV contains several optical flow implementations: CalcOpticalFlowBM(), which 
uses block matching; CalcOpticalFlowHS(), which uses [15] (both of these currently 
only in the old cv module); the pyramidal Lucas-Kanade algorithm [19] 
calcOpticalFlowPyrLK(); and finally calcOpticalFlowFarneback() based on [10]. The 
last one is considered one of the best methods for obtaining dense flow fields. Lets 
look at an example of using this to find motion vectors in video (the Lucas-Kanade 
version is the subject of the next section). 

Try running the following script: 
import cv2 

def draw_flow(im,flow,step=l6): 

Plot optical floM at sample points 
spaced step pixels apart. """ 

h,w = im.shapej:2] 
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y,x = mgrid[step/2:h:step,step/2:w:step].reshape(2,-l) 
fx,fy = flow[y,x].T 

# create line endpoints 

lines = vstack([x,y,x+fx,y+fy]).T.reshape(-1,2,2) 
lines = int32(lines) 

# create image and draw 

vis = cv2.cvtColor(im,cv2.C0L0R_GRAY2BGR) 
for (xl,yl),(x2,y2) in lines: 
cv2.line(vis,(xl,yl),(x2,y2),(0,255,o),l) 

CV2. circle(vis,(xl,yl), 1,(0,255,0), -l) 
return vis 


# Setup video capture 
cap = cv2.VideoCapture(o) 

ret,im = cap.read() 

prev_gray = cv2.cvtColor(im,cv2.C0L0R_BGR2GRAY) 

while True: 

# get grayscale image 
ret,im = cap.read() 

gray = cv2.cvtColor(im,cv2.C0L0R_BGR2GRAY) 

# compute flou 

flow = cv2.calcOpticalFlowFarneback(prev_gray,gray,None,0.5,3,l5,3,5,l.2,o) 
prev_gray = gray 

# plot the flow vectors 

cv2.inishow('Optical flow',draw_flow(gray,flow)) 
if cv2.waitKey(lo) == 27: 
break 

This example will capture images from a webcam and call the optical flow estimation on 
every consecutive pair of images. The motion flow vectors are stored in the two-channel 
image/low returned hy calcOpticalFlowFarneback(). Besides the previous frame and 
the current frame, this function takes a sequence of parameters. Look them up in the 
documentation if you are interested. The helper function draw_flow() plots the motion 
vectors at evenly spaced points in the image. It uses the OpenCV drawing functions 
line() and circle(), and the variahle step Controls the spacing of the flow samples. 
The resuit can look like the screenshots in Figure 10-5. Here the positions of the flow 
samples are shown as a grid of circles and the flow vectors with lines show how each 
sample point moves. 

The Lucas-Kanade Algorithm 

The most hasic form of tracking is to follow interest points such as corners. A popular 
algorithm for this is the Lucas-Kanade tracking algorithm, which uses a sparse optical 
flow algorithm. 

Lucas-Kanade tracking can he applied to any type of features, hut usually makes 
use of corner points similar to the Harris corner points in Section 2.1. The function 
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Figure 10-5. Optical flow vectors (sampled at every 16th pixel) shown on video of a translating book 
and a turning head. 


goodFeaturesToTrackO detects corners according to an algorithm by Shi and Tomasi 
[33], where corners are points with two large eigenvalues of the structure tensor (Harris 
matrix) equation (2.2) and where the smaller eigenvalue is above a threshold. 


The optical flow equation is under-determined (meaning that there are too many 
unknowns per equation) if considered on a per-pixel basis. Using the assumption that 
neighboring pixels have the same motion, it is possible to stack many of these equations 
into one system of equations like this 
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for some neighborhood of n pixels. This has the advantage that the system now has 
more equations than unknowns and can be solved with least square methods. Typically 
the contribution from the surrounding pixels is weighted so that pixels farther away 
have less influence. A Gaussian weighting is the most common choice. This turns the 
matrix above into the structure tensor in equation (2.2), and we have the relation 


M^v = - 


^r(xi) 


or simpler Av = b. 


L J 


This over-determined equation system can be solved in a least square sense and the 
motion vector is given by 


v= (A^A)-U^b. 


This is solvable only when A is invertible, which it is by construction if applied 
at Harris corner points or the “good features to track” of Shi-Tomasi. This is how the 
motion vectors are computed in the Lucas-Kanade tracking algorithms. 
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Standard Lucas-Kanade tracking works for small displacements. To handle larger dis- 
placements, a hierarchical approach is used. In this case, the optical flow is com- 
puted at coarse-to-fine versions of the image. This is what the OpenCV function 
calcOpticalFlowPyrLK() does. 

The Lucas-Kanade functions are included in OpenCV Lets look at how to use those 
to build a Python tracker class. Create a file Iktrack.py and add the following class and 
constructor: 

import cv2 

# some constants and default parameters 
lk_paranis = dict(winSize=(l5,15),maxLevel=2, 

criteria=(cv2.TERM_CRITERIA_EPS | cv2.TERM_CRITERIA_COUNT,10,0.03)) 

subpix_params = dict(zeroZone=(-l,-l),winSize=(l0,10), 

criteria = (cv2.TERM_CRITERIA_C0UNT | cv2.TERM_CRITERIA_EPS,20,0.03)) 

feature_params = dict(maxCorners=500,qualityLevel=0.01,minDistance=10) 


class LKTracker(object): 

Class for Lucas-Kanade tracking with 
pyramidal optical flow.""" 

def _init_(self,imnames): 

Initialize with a list of image names. 

self.imnames = imnames 
self.features = [] 
self.tracks = [] 
self.current frame = 0 


The tracker object is initialized with a list of filenames. The variables/eatures and tracks 
will hold the corner points and their tracked positions. We also use a variable to keep 
track of the current frame. We define three dictionaries with parameters for the feature 
extraction, the tracking, and the subpixel feature point refinement. 

Now, to start detecting points, we need to load the actual image, create a grayscale 
version, and extract the “good features to track” points. The OpenCV function doing 
the main work is goodFeaturesToTrack(). Add this detect_points() method to the class: 


def detect_points(self): 

Detect 'good features to track' (corners) in the current frame 
using sub-pixel accuracy. """ 

# load the image and create grayscale 

self.image = cv2.imread(self.imnames[self.current_frame]) 
self.gray = cv2.cvtColor(self.image,cv2.C0L0R_BGR2GRAY) 

# search for good points 

features = cv2.goodFeaturesToTrack(self.gray, **feature_params) 
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# refine the corner locations 

cv2.cornerSubPix(self.gray,features, ’''*subpix_params) 
self.features = features 

self.tracks = [[p] for p in features.reshape((-l,2))] 
self.prev_gray = self.gray 

The point locations are refined using cornerSubPix( ) and stored in the member variables 
features and tracks. Note that running this function clears the track history 

Now that we can detect the points, we also need to track them. First, we need to get the 
next frame, apply the OpenCV function calcOpticalFlowPyrLK() that finds out where 
the points moved, and then remove and clean the lists of tracked points. The method 
track_points() below does this: 

def track_points(self): 

Track the detected features. """ 

if self.features != []: 
self.stepO # move to the next frame 

# load the image and create grayscale 

self.image = cv2.imread(self.imnames[self.current_frame]) 
self.gray = cv2.cvtColor(self.image,cv2.C0L0R_BGR2GRAY) 

# reshape to fit input format 

tmp = float32(self.features).reshape(-l, 1, 2) 

# calculate optical flow 

features,status,track_error = cv2.calc0pticalFlowPyrLK(self.prev_gray, 

self.gray,tmp,None,**lk_params) 


# remove points lost 

self.features = [p for (st,p) in zip(status,features) if st] 

# clean tracks from lost points 
features = array(features).reshape((-l,2)) 
for i,f in enumerate(features): 

self.tracks[i].append(f) 

ndx = [i for (i,st) in enumerate(status) if not st] 
ndx.reverse0 # remove from back 
for i in ndx: 
self.tracks.pop(i) 

self.prev_gray = self.gray 

This makes use of a simple helper method step() that moves to the next available frame: 

def step(self,framenbr=None): 

Step to another frame. If no argument is 
given, step to the next frame. """ 

if framenbr is None: 

self.current_frame = (self.current_frame + l) % len(self.imnames) 
else: 

self.current_frame = framenbr % len(self.imnames) 
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This method jumps to a given frame or just to the next if no argument is given. 

Finally, we also want to be able to draw the resuit using OpenCV Windows and drawing 
functions. Add this draw() method to the LKTracker class: 

def draw(self): 

Draw the current image with points using 
OpenCV's own drawing functions. 

Press ant key to close window.""" 

# draw points as green circles 
for point in self.features: 

cv2.circle(self.image,(int(point[o][0]),int(point[o][l])),3,(0,255, o),-l) 

cv2.imshow('LKtrack',self.image) 
cv2.waitKey() 

Now we have a complete self-contained tracking system using OpenCV functions. 

Using the tracker 

Lets tie it all together by using this tracker class on a real tracking scenario. The 
following script will initialize a tracker object, detect and track points through the 
sequence, and draw the resuit: 

import Iktrack 

imnames = [' bt.003.pgm' , 'bt.002.pgm', 'bt.OOl.pgm', 'bt.OOO.pgm'] 

# create tracker object 

Ikt = Iktrack.LKTracker(imnames) 

# detect in first frame, track in the remaining 
Ikt.detect_points() 

Ikt.drawO 

for i in range(len(imnames)-l): 
lkt.track_points() 

Ikt.drawO 

The drawing is one frame at a time and shows the points currently tracked. Pressing 
any key will move to the next image in the sequence. The resulting figure Windows for 
the first four images of the Oxford corridor sequence (one of the Oxford multi-view 
datasets available at http://www.robots.ox. ac.uk/~vgg/(iata/data-mview.html) looks like 
Figure 10-6. 

Using generators 

Add the following method to the LKTracker class: 
def track(self): 

""" Generator for stepping through a seguence.""" 

for i in range(len(self.imnames)): 
if self.features == []: 

self.detect_points() 
else: 

self.track_points() 
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Figure 10-6. Tracking using the Lucas-Kanade algorithm through the LKTrack class. 


# create a copy in RGB 

f = array(self.features).reshape(-l,2) 

im = cv2.cvtColor(self.image,cv2.C0L0R_BGR2RGB) 

yield im,f 

This creates a generator that makes it easy to step through a sequence and get tracks 
and the images as RGB arrays so that it is easy to plot the resuit. To use it on the classic 
Oxford “dinosaur” sequence (from the same multi-view dataset page as the corridor 
above) and plot the points and their tracks, the code looks like this: 

import Iktrack 

imnames = ['viff.OOO.ppm', 'viff.OOl.ppm', 

'viff.002.ppm', 'viff.003.ppm', 'viff.004.ppm'] 

# tiack using the LKTracker generator 
Ikt = Iktrack.LKTracker(imnames) 
for im,ft in lkt.track(): 
print 'tracking %d features' % len(ft) 
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Figure 10-7. An example of using Lucas-Kanade tracking on a turntable sequence and plotting the 
tracks ofpoints. 


# plot the tracks 
figureO 
imshow(im) 
for p in ft: 

plot(p[0],p[l]/bo') 
for t in Ikt.tracks: 

plot([p[o] for p in t],[p[l] for p in t]) 
axis('off') 
show() 

This generator makes it really easy to use the tracker class and completely hides the 
OpenCV functions from the user. The example generates a plot like the one shown in 
Figure 10-7 and the bottom right of Figure 10-6. 

10.5 More Examples 

With OpenCV comes a number of useful examples of how to use the Python interface. 
These are in the sub-directory samples/python2/ and are a good way to get familiar 
with OpenCV Flere are a few selected examples to illustrate some other capabilities of 
OpenCV 

Inpainting 

The reconstruction of lost or deteriorated parts of images is called inpainting. This 
covers both algorithms to recover lost or corrupted parts of image data for restora- 
tion purposes as well as removing red-eyes or objects in photo-editing applications. 
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Figure 10-8. An example of inpainting with OpenCV. The left image shows areas marked by a user as 
“corrupt.” The right image shows the resuit after inpainting. 


Typically, a region of the image is marked as “corrupt” and needs to be filled using the 
data from the rest of the image. 

Try the following command: 

$ python inpaint.py empire.jpg 

This will open an interactive window where you can draw regions to be inpainted. The 
results are shown in a separate window. An example is shown in Figure 10-8. 

Segmentatiori with the Watershed Transform 

Watershed is an image processing technique that can be used for segmentation (see 
Figure 10-9). An image is treated as a topological landscape that is “flooded” from a 
number of seed regions. Usually a gradient magnitude image is used since this has 
ridges at strong edges and will make the segmentation stop at image edges. 

The implementation in OpenCV uses an algorithm by Meyer [22]. Try it using the 
following command: 

$ python watershed.py empire.jpg 

This will open an interactive window where you can draw the seed regions you want 
the algorithm to use as input. The results are shown in a second window with colors 
representing regions overlaid on a grayscale version of the input image. 

Line Detection with a Hough Transform 

The Hough transform {http://en.wikipedia.org/wiki/Hough_transform) is a method for 
finding shapes in images. It works by using a voting procedure in the parameter space 
of the shapes. The most common use is to find line structures in images. In that case, 
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Figure 10-9. An example of segmenting an image using a watershed tmnsform. The left image is the 
input image with seed regions drawn. The right image shows the resulting segmentation with colors 
overlaid on the image. 



Figure 10-10. An example of detecting lines using a Flough transform. The left image is the source in 
grayscale. The right image shows an edge map with detected lines. 


edges and line segments can be grouped together by them voting for the same line 
parameters in the 2D parameter space of lines. 

The OpenCV sample detects lines using this approach.^ Try the following command: 

$ python houghlines.py empire.jpg 

This gives two Windows like the ones shown in Figure 10-10. One window shows the 
source image in grayscale, and the other shows the edge map used together with lines 

^ This sample is currently in the /samples/python folder. 
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detected as those with most votes in parameter space. Note that the lines are always 
infinite; if you want to find the endpoints of line segments in the image, you can use 
the edge map to try to find them. 


Exercises 

1. Use optical flow to build a simple gesture recognition system. For example, you 
could sample the flow as in the plotting function and use these sample vectors as 
input. 

2. There are two warp functions available in OpenCV, cv2.warpAffine() and 
cv2.warpPerspective(). Try to use them on some of the examples from Chapter 3. 

3. Use the flood fili function to do background subtraction on the Oxford “dinosaur” 
images used in Figure 10-7. Create new images with the dinosaur placed on a 
different color background or on a different image. 

4. OpenCV has a function cv2.findChessboardCorners(), which automatically finds 
the corners of a chessboard pattern. Use this function to get correspondences for 
calibrating a camera with the function cv2.calibrateCamera(). 

5. If you have two cameras, mount them in a stereo rig setting and capture stereo 
image pairs using cv2.VideoCapture() with different video devi ce ids. Try 0 and 1 
for starters. Compute depth maps for some varying scenes. 

6. Use FIu moments with cv2.HuMoments() as features for the Sudoku OCR classifica- 
tion problem in Section 8.4 and check the performance. 

7. OpenCV has an implementation of the Grab Cut segmentation algorithm. Use 
the function cv2.grabCut() on the Microsoft Research Grab Cut dataset (see Sec¬ 
tion 9.1). Flopefully you will get better results than the low-resolution segmentation 
in our examples. 

8. Modify the Lucas-Kanade tracker class to take a video file as input and write a 
script that tracks points between frames and detects new points every k frames. 
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APPENDIX A 


Installing Packages 


Here are short installation instructioris for the packages used in the book. They are 
written based on the latest versions as of the writing of this book. Things change (URLs 
change!), so if the instructions become outdated, check the individual project websites 
for help. 

In addition to the speci fic instructions, an option that often works on most platforms is 
Pythons easy_install. If you run into problems with the installation instructions given 
here, easy_install is worth a try. Find out more on the package website, httpiHpackages 
.python.org/distribute/easy_install.html. 

A.1 NumPyandSciPy 

Installing NumPy and SciPy is a little different depending on your operating system. 
Follow the applicable instructions below. The current versions are 2.0 (NumPy) and 0.11 
(SciPy) on most platforms. A package that currently works on all major platforms is 
the Enthought EPD Free bundle, a free light version of the commercial Enthought 
distribution, available at http://enthought.com/products/epd_free.php. 

Windows 

The easiest way to install NumPy and SciPy is to download and install the binary distri- 
butions from http://www.scipy.org/Download. 

Mac OS X 

Later versions of Mac OS X (10.7.0 [Lion] and up) come with NumPy pre-installed. 

An easy way to install NumPy and SciPy for Mac OS X is with the “superpack” from 
https://github.com/fonnesbeck/ScipySuperpack. This also gives you Matplotlib. 

Another alternative is to use the package system MacPorts {http://www.macports.org/). 
This also works for Matplotlib instead of the instructions below. 

If none of those work, the project web page has other alternatives listed {http://scipy 
■ org/). 
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Linux 

Installation requires that you have administrator rights on your computer. On some 
distributions NumPy comes pre-installed, on others not. Both NumPy and SciPy are most 
easily installed with the built-in package handler (for example Synaptic on Ubuntu). 
You can also use the package handler for Matplotlib instead of the instructions below. 

A.2 Matplotlib 

Here are instructions for installing Matplotlib in case your NumPy/SciPy installation did 
not also install Matplotlib. Matplotlib is freely available at http://matplotlib.sourceforge 
.net/. Click the “download” link and download the installer for the latest version for 
your System and Python version. Currently the latest version is 1.1.0. 

Alternatively just download the source and unpack. Run 

$ python Setup.py install 

from the command line and everything should work. General tips on installing for 
different systems can be found at http://matplotlib.sourceforge.net/users/installing.html, 
but the process above should work for most platforms and Python versions. 


A.3 PIL 

PIL, the Python Imaging Library is available at http://www.pythonware.com/products/ 
pii/. The latest free version is 1.1.7. Download the source kit and unpack the folder. In 
the downloaded folder, run 

$ python Setup.py install 

from the command line. 

You need to have JPEG (libjpeg) and PNG (zlib) supported if you want to save images 
using PIL. See the README file or the PIL website if you encounter any problems. 


A.4 LibSVM 

The current release is version 3.1 (released April 2011). Download the zip file from the 
LibSVM website {http://www.csie.ntu.edu.tw/~cjlin/libsvm/). Unzip the file (a directory 
“libsvm-3.1 ” will be created). In a terminal window, go to this directory and type “make”: 

$ cd libsvm-3.0 
$ make 

Then go to the “python” directory and do the same: 

$ cd python/ 

$ make 
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This should be all you need to do. To test your installation, start Python from the 
command line and try: 

import svm 

The authors wrote a practical guide for using LivSVM [7]. This is a good starting point. 


A.5 OpenCV 

Installing OpenCV is a bit different, depending on your operating system. Follow the 
applicable instructions below. 

To check your installation, start Python and try the cookbook examples at http://opencv 
.willowgarage.com/documentation/python/cookbook.html. The online OpenCV Python 
reference guide gives more examples and details on how to use OpenCV with Python 
at http://opencv.willowgarage.com/documentation/python/index.html. 


Windows and Unix 

There are installers for Windows and Unix available at the SourceForge repository 
http://sourceforge.net/projects/opencvlibrary/. 


Mac OS X 

Mac OS X support has been lacking but is on the rise. There are several ways to install 
from source as described on the OpenCV wiki, http://opencv.willowgarage.com/wiki/ 
InstallGuide. MacPorts is one option that works well if you are using Python, NumPy, 
SciPy, or Matplotlib, also from MacPorts. Building OpenCV from source can be done 
like this: 

$ svn co https://code.ros.org/svn/opencv/trunk/opencv 
$ cd opencv/ 

$ sudo cmake -C "Unix Makefiles" . 

$ sudo make -j8 
$ sudo make install 

If you have all the dependencies in place, everything should build and install properly 
If you get an error like 

import cv2 

Traceback (most recent call last): 

File line 1 , in 
ImportError: No module named cv2 

then you need to add the directory containing cv2.so to PYTFIONPATFI. For example: 
$ export PYTH0NPATH=/usr/local/lib/python2.7/site-packages/ 
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Linux 

Linux users could try the package installer for the distribution (the package is usually 
called “opencv”) or install from source as described in the Mac OS X section. 

A.6 VLFeat 

To install VLFeat, download and unpack the latestbinary package from http://vlfeat.org/ 
download.html (currently the latest version is 0.9.14). Add the paths to your environment 
or copy the binaries to a directory in your path. The binaries are in the bin/ directory 
just pick the sub-directory for your platform. 

The use of the VLFeat command line binaries is described in the src/ sub-directory 
Alternatively you can find the documentation online at http://vlfeat.org/man/man.html. 

A.7 PyGame 

PyGame can be downloaded from http://www.pygame.org/download.shtml. The latest 
version is 1.9.1. The easiest way is to get the binary install package for your system and 
Python version. 

Alternatively, you can download the source, and in the downloaded folder run 
$ python Setup.py install 
from the command line. 

A.8 PyOpenGL 

Installing PyOpenGL is easiest done by downloading the package from http://pypi 
.python.org/pypi/PyOpenGL as suggested on the PyOpenGL web page, http://pyopengl 
.sourceforge.net/. Get the latest version, currently 3.0.1. 

In the downloaded folder, do the usual 

$ python Setup.py install 

from the command line. If you get stuck or need information on dependencies, etc., 
more documentation can be found at http://pyopengl.sourceforge.net/documentation/ 
installation.html. Some good demo Scripts for getting started are available at http://pypi 
.python.org/pypi/PyOpenGL-Demo. 

A.9 Pydot 

Begin by installing the dependencies, GraphViz and Pyparsing. Go to http://www 
.graphviz.org/ and download the latest GraphViz binary for your platform. The install 
files should install GraphViz automatically 
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Then go to the Pyparsing project page http://pyparsing.wikispaces.com/. The download 
page is at http://sourceforge.net/proiects/pyparsing/. Get the latest version (currently 1.5.5) 
and unzip the file to a directory Type 

$ python Setup.py install 

from the command line. 

Finally go to the project page http://code.google.eom/p/pydot/ and click “download”. 
From the download page, download the latest version (currently 1.0.4). Unzip and again 
type 

$ python Setup.py install 

from the command line. Now you should be able to import pydot in your Python 
sessions. 


A.10 Python-graph 

Python-graph is a Python module for working with graphs and contains lots of useful 
algorithms like traversals, shortest path, pagerank, and maximum flow. The latest 
version is 1.8.1 and can be found on the project website http://code.google.com/p/python- 
graph/. If you have easy_install on your system, the simplest way to get python-graph is: 

$ easy_install python-graph-core 

Alternatively download the source code from http://code.google.eom/p/python-graph/ 
downloads/list and run: 

$ python Setup.py install 

To write and visualize the graphs (using the DOT language) you need python-graph- 
dot, which comes with the download or through easy_install: 

$ easy_install python-graph-dot 

Python-graph-dot depends on pydot; see above. The documentation (in html) is in the 
“docs/” folder. 

A.11 Simplejson 

Simplejson is the independently maintained version of the JSON module that comes 
with later versions of Python (2.6 or later). The syntax is the same for both modules, 
but simplejson is more optimized and will give better performance. 

To install, go to the project page https://github.com/simplejson/simpleison and click the 
Download button. Then select the latest version from the “Download Packages” section 
(currently this is 2.1.3). Unzip the folder and type 

$ python Setup.py install 

from the command line. This should be all you need. 
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A.12 PySQLite 

PySQLite is an SQLite binding for Python. SQLite is a lightweight, disk-based database 
that can be queried with SQL and is easy to install and use. The latest version is 2.63; 
see the project website, http://code.google.eom/p/pysqlite/, for more details. 

To install, download from http://code.google.eom/p/pysqlite/downloads/list and unzip to 
a folder. Run 

$ python Setup.py install 

from the command line. 

A.13 CherryPy 

CherryPy (http://www.cherrypy.org/) is a fast, stable, and lightweight web server built 
on Python using an object-oriented model. CherryPy is easy to install; just download 
the latest version from http://www.cherrypy.org/wiki/CherryPyInstall. The latest stable 
release is 3.2.0. Unpack and run 

$ python Setup.py install 

from the command line. After installing, look at the tiny tutorial examples that come 
with CherryPy in the cherrypry/tutorial/ folder. These examples show you how to pass 
GET/POST variables, inheritance of page properties, file upload and download, etc. 
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APPENDIX B 


Image Datasets 


B.1 Flickr 

The immensely popular photo-sharing site Flickr {http://flickr.com/) is a gold mine for 
computer Vision researchers and hobbyists. With hundreds of millions of images, many 
of them tagged by users, it is a great resource to get training data or for doing exper- 
iments on real data. Flickr has an API for interfacing with the Service that makes it 
possible to upload, download, and annotate images (and much more). A full descrip- 
tion of the API is available at http://flickr.com/services/api/, and there are kits for many 
programming languages, including Python. 

Lets look at using a library called flickrpy available freely at http://code.google.eom/p/ 
flickrpy/. Download the file flickr.py. You will need an API Key from Flickr to get this to 
Work. Keys are free for non-commercial use and can be requested for commercial use. 
Just click the link ‘Apply for a new API Key” on the Flickr API page and follow the 
instructions. Once you have an API key, open flickrpy and replace the empty string on 
the line 

API_KEY = '' 

with your key It should look something like this: 

API_KEY = 'I23fbbb8l441231123cgg5bl23d92123' 

Lets create a simple command line tool that downloads images tagged with a particular 
tag. Add the following code to a new file called tagdownload.py. 

import flickr 
import urllib, urlparse 
import os 
import sys 

if len(sys.argv)>l: 

tag = sys.argv[l] 
else: 

print 'no tag specified' 
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# downloading image data 

f = flickr.photos_search(tags=tag) 

urllist = [] #store a list of what was downloaded 

# downloading images 
for k in f: 

uri = k.getURL(size='Medium', urlType='source') 

urllist.append(url) 

image = urllib.LIRLopener() 

image.retrieve(uri, os.path.basename(urlparse.urlparse(url).path)) 
print 'downloading:', uri 

If you also want to write the list of uris to a text file, add the following lines at the end: 

# write the list of uris to file 
fl = open(' urllist.txt ', 'w') 
for uri in urllist: 

fl.write(url+'\n') 
fl.closeO 

From the command line, just type 

$ python tagdownload.py goldengatebridge 

and you will get the 100 latest images tagged with “goldengatebridge”. As you can see, 
we chose to take the “Medium” size. If you want thumbnails or full-size originals or 
something else, there are many sizes available; check the documentation on the Flickr 
website, http://flickr.com/api/. 

Flere we were just interested in downloading images; for API calls that require authen- 
tication the process is slightly more complicated. See the API documentation for more 
Information on how to set up authenticated sessions. 


B.2 Panoramio 

A good source of geotagged images is Googles photo-sharing Service Panoramio {http:// 
www.panoramio.com/). This web Service has an API to access content programmatically 
The API is described at http://www.panoramio.com/api/. You can get website widgets 
and access the data using JavaScript objects. To download images, the simplest way is 
to use a GET call. For example: 

http://www.panoramio.com/map/get_panoramas.php?order=popularity&set=public& 
from=08ito=20&minx=-l80&miny=-90&maxx=l80&maxy=90&size=medium 

where minx, miny, maxx, maxy define the geographic area to select photos from 
(minimum longitude, latitude, maximum longitude and latitude, respectively). The 
response will be in JSON and look like this: 

{"count": 3152 , "photos": 

[{"upload_date": "02 February 2006", "owner_name": "***", "photo_id": 9439, 

"longitude": -151.75, "height": 375, "width": 500, "photo_title": 

"latitude": - 16 . 5 , "owner_url": "http://www.panoramio.com/user/l600", "owner_id": 1600 , 
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"photo_file_url’': "http://mw 2 .google.com/mw-panoramio/photos/medium/9439.jpg", 
"photo_url": "http://www.panoramio.com/photo/9439"}, 

{"upload_date": "18 ianuary 2011", "owner_name": "***", "photo_id": 46752123, 
"longitude": 120.52718600000003, "height": 370, "width": 500, "photo_title": 
"latitude": 23.327833999999999, "owner_url": "http://www.panoramio.com/user/ 2780232 ", 
"owner_id": 2780232 , 

"photo_file_url’': "http://mw 2 .google.com/mw-panoramio/photos/medium/ 46752 i 23 .jpg", 
"photo_url": "http://www.panoramio.com/photo/ 46752 i 23 "}, 

{"upload_date": "20 ianuary 2011", "owner_name": "***", "photo_id": 46817885 , 
"longitude": -178.13709299999999, "height": 330, "width": 500, "photo_title": 
"latitude": - 14 .310613, "owner_url": "http://www.panoramio.com/user/919358", 
"owner_id": 919358, 

"photo_file_url’': "http://mw 2 .google.com/mw-panoramio/photos/medium/ 468 i 7885 .jpg", 
"photo_url": "http://www.panoramio.com/photo/ 468 i 7885 "}. 


], "has_more’': true} 

Using a JSON package, you can get the “photo_file_urr field of the resuit. See Sec- 
tion 2.3 for an example. 


B.3 Oxford Visual Geometry Group 

The Visual Geometry research group at Oxford University has many datasets available 
at http:llwww.robots.ox. ac.ukl-vggldatal. We used some of the multi-view datasets 
in this book, for example the “Mertonl”, “Model House”, “dinosaur”, and “corridor” 
sequences. The data is available for download (some with camera matrices and point 
tracks) at http://www.robots.ox.ac.uk/~vgg/data/data-mview.html. 


B.4 University of Kentucky Recognition Benchmark Images 

The UK Benchmark image set, also called the “ukbench” set, is a set with 2^50 groups 
of images. Each group has four images of an object or scene from varying viewpoints. 
This is a good set to test object recognition and image retrieval algorithms. The data 
set is available for download (the full set is around 1.5 GB) at http://www.vis.uky.edu/ 
-stewe/ukbench/. It is described in detail in the paper [23]. 

In this book, we used a smaller subset using only the first 1,000 images. 


B.5 Other 

Prague Texture Segmentation Datagenerator and Benchmark 

This set used in the segmentation chapter can generate many different types of texture 
segmentation images. Available at http://mosaic.utia.cas.cz/index.php. 
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MSR Cambridge Grab Cut Dataset 

Originaly used in the Grab Cut paper [27], this set provides segmentation images 
with User annotations. The data set and some papers are available from http://research 
.microsoft.com/en-us/um/cambridge/projects/visionimagevideoediting/segmentation/grab 
cut.htm. The original images in the data set are from a data set that now is part of 
the Berkeley Segmentation Dataset, http://www.eecs.berkeley.edu/Research/Projects/CS/ 
Vision/grouping/segfaench/. 

Caltech 101 

This is a classic dataset that contains pictures of objects from 101 different categories 
and can be used to test object recognition algorithms. The data set is available at http:// 
WWW. Vision, cal tech. edu/Image_Datasets/CaltechlOl/. 

Static Hand Posture Database 

This dataset from Sebastien Marcel is available at http://www.idiap.ch/resource/gestures/ 
together with a few other sets with hands and gestures. 

Middiebury Stereo Datasets 

These are datasets used to benchmark stereo algorithms. They are available for down- 
load at http://vision.middlebury.edu/stereo/data/. Every stereo pair comes with ground 
truth depth images to compare results against. 
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APPENDIX C 


Image Credits 


Throughout this book we have made use of publicly available datasets and images 
available from web Services; these were listed in Appendix B. The contributions of the 
researchers behind these datasets are greatly appreciated. 

Some of the reoccurring example images are the authors own. You are free to use 
these images under a Creative Commons Attribution 3.0 (CC BY 3.0) license {http:// 
creativecommons.orgIlicensesIhyIS.O/), for example by citing this book. 

These images are: 

• The Empire State building image used in almost every example throughout the 
book. 

• The low contrast image in Figure 1-7. 

• The feature matching examples used in Figures 2-2, 2-5, 2-6, and 2-7. 

• The Fishermans Wharf sign used in Figures 9-6, 10-1, and 10-2. 

• The little boy on top of a hili used in Figures 6-4, 9-6. 

• The book image for calibration used in Figures 4-3. 

• The two images of the 0’Reilly open source book used in Figures 4-4, 4-5, and 4-6. 

C.1 Images from Flkkr 

We used some images from Flickr available with a Creative Commons Attribution 2.0 
Generic (CC BY 2.0) license (http:llcreativecommons.org/licenseslbyl2Dldeed.en). The 
contributions from these photographers is greatly appreciated. 

The images used from Flickr are (names are the ones used in the examples, not the 
original filenames): 

• billboard_for_rent.jpg by @striatic, http://flickr.com/photos/striatic/21671910/, used 
in Figures 3-2. 

• blank_billboard.jpg by @mediaboytodd, http://flickr.eom/photos/23883605@N06/ 
2317982570/, used in Figures 3-3. 
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• beatles.jpg by @oddsock, http://flickr.com/photos/oddsock/82535061/, used in Fig¬ 
ures 3-2, 3-3. 

• turningtorsol.jpg by @rutgerblom, http://www.flickr.com/photos/rutgerblom/2873 
185336/, used in Figure 3-5. 

• sunset_tree.jpg by @jpck, http://www.flickr.com/photos/jpck/3344929385/, used in 
Figure 3-5. 

C.2 Otherimages 

• The face images used in Figures 3-6, 3-7, and 3-8 are courtesy of J. K. Keller. The 
eye and mouth annotations are the authors. 

• The Lund University building images used in Figures 3-9, 3-11, and 3-12 are from a 
dataset used at the Mathematical Imaging Group, Lund University Photographer 
was probably Magnus Oskarsson. 

• The toy plane 3D model used in Figure 4-6 is from Gilles Tran (Creative Commons 
License By Attribution). 

• The Alcatraz images in Figures 5-7 and 5-8 are courtesy of Cari Olsson. 

• The font data set used in Figures 1-8, 6-2, 6-3 6-7, and 6-8 is courtesy of Martin 
Solli. 

• The Sudoku images in Figures 8-6, 8-7, and 8-8 are courtesy of Martin Byrod. 

C.3 lllustrations 

The epipolar geometry illustration in Figure 5-1 is based on an illustration by Klas 

Josephson and adapted for this book. 
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gesture recognition, 172 
GL_MODELVIEW, 90 
GL_PROJECT10N, 90 
Grab Gut dataset, 197 
gradient angle, 18 
gradient magnitude, 18 
graph, 191 
graph cut, 191 
GraphViz, 48 
graylevel transforms, 8 

H 

Harris corner detection, 29 
Harris matrix, 29 
hierarchical clustering, B3 
hierarchical fc-means, 145 
histogram equalization, 10 
Histogram of Oriented Gradients, 170 
HOG, 170 

homogeneous coordinates, 53 
homography, 53 
homography estimation, 54 
Hough transform, 224 

I 

Image, 1 

image contours, 4 
ImageDraw, BO 
image gradient, 18 
image graph, 193 
image histograms, 4 
image patch, 32 
image plane, 79 


image registration, 64 

image retrieval, 147 

image search demo, 162 

image segmentation, 131, 191 

image thumbnails, 2 

inliers, 70 

inpainting, 223 

integral image, 211 

interest point descriptor, 32 

interest points, 29 

inverse depth, 80 

inverse document frequency, 148 

io, 22 

iso-contours, 4 

J 

JSON, 45 

K 

kernel functions, 180 
/c-means, 127 

/c-nearest neighbor classifier, 167 
kNN, 167 

L 

Eaplacian matrix, 142 

least squares triangulation, 108 

EibSVM, 180 

local descriptors, 29 

Eucas-Kanade tracking algorithm, 217 

M 

marking points, 7 
mathematical morphology, 20 
Matplotlib, 3 

maximum flow (max flow), 192 
measurements, 21, 186 
metric reconstruction, 100, 112 
minidom, 65 

minimum cut (min cut), 192 
misc, 22 
morphology, 20 
morphology, 21, 27 
mplotBd, 102, 118 
multi-class SVM, 185 
multi-dimensional arrays, 7 
multi-dimensional histograms, 137 
multiple view geometry, 99 

N 

naive Bayes classifier, 175 
ndimage, 57 
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ndlmage.filters,122 
normalized cross-correlation, 33 
normalized cut, 200 
NumPy, 7 

0 

objloader, 96 
OCR, 183 
OpenCV, ix, 209 
OpenGL, 90 

OpenGL projectiori matrix, 91 

optical axis, 79 

optical center, 81 

optical character recognition, 183 

optical flow, 216 

optical flow equation, 216 

optic flow, 216 

outliers, 70 

overfitting, 190 

P 

panograph, 78 
panorama, 70 
PCA, 13 

pickle, 15, 128, 151 
pickling, 15 

piecewise affine warping, 61 
piecewise constant image model, 205 
PIL, 1 

pin-hole camera, 79 
plane sweeping, 121 
plot formatting, 4 
plotting, 3 

point correspondence, 32 
pose estimation, 86 
Prewitt filters, 18 

Principal Component Analysis, B, 178 

Principal point, 81 

projection, 79 

projection matrix, 79 

projective camera, 79 

projective transformation, 53 

pydot, 48 

pygame, 90 

pygame.image, 90 

pygame.locals, 90 

Pylab, 3 

PyOpenGL, 90 

pyplot, 27 

pysqlite, 152 

pysqlite2,152 

python-graph,192 


Python Imaging Library, 1 

Q 

quad, 92 

query with image, 157 
quotient image, 27 

R 

radial basis functions, 180 
ranking using homographies, 160 
RANSAC, 70, 114 
rectified image pair, 120 
rectifying images, 187 
registration, 64 
rigid transformation, 54 
robust homography estimation, 72 
ROF, 23, 205 
RQ-factorization, 83 

Rudin-Osher-Fatemi de-noising model, 23 

s 

Scale-lnvariant Feature Transform, 36 
scikit.learn,190 
Scipy, 16 

scipy.cluster.vq, 128, BO 
scipy.io, 22, 23 
scipy.misc, 23 

scipy.ndimage, 18, 21, 186, 189, 190 
scipy.ndimage.filters, 17, 18, 30 
scipy.sparse,166 
searching images, 147, 155 
segmentation, 191 
self-calibration, 120 
separating hyperplane, 179 
SfM, IB 
SIFT, 36 

similarity matrix, 140 
similarity transformation, 54 
similarity tree, 133 
simplejson, 45, 46 
single linking, 136 
slicing, 8 
Sobel filters, 18 
spectral clustering, 140, 200 
SQLite, 152 
SSD, 33 

stereo imaging, 120 
stereo reconstruction, 120 
stereo rig, 120 
stereo vision, 120 
stitching images, 75 
stop words, 148 



Index I 245 










structure from motion, IB 
structuring element, 21 
Sudoku reader, 183 
sum of squared differences, 33 
Support Vector Machines, 179 
support vectors, 180 
SVM, 179 

T 

term frequency, 148 

term frequency-inverse document frequency, 
148 

text mining, 147 

tf-idf weighting, 148 

total variation, 23 

total within-class variance, 127 

tracking, 216 

triangulation, 107 

u 

unpickling, 15 
unsharp masking, 26 
urllib, 46 


V 

variational methods, 205 
variational problems, 205 
vector quantization, 128 
vector space model, 147 
vertical field of view, 91 
video, 213 

visual codebook, 148 
visualizing image distribution, 131 
visual vocabulary, 148 
visual words., 148 
VLFeat, 37 

w 

warping, 57 
watershed, 224 
web applications, 162 
webcam, 217 
word index, 152 

X 

XML, 65 
xml.dom, 65 
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Colophon 

The animal on the cover of Programming Computer Vision with Python is a bullhead. 

Often referred to as “bullhead catfish,” members of the genus Ameiurus come in three 
common types: the black bullhead (Ameiurus melas), the yellow bullhead (Ameiurus 
natalis), and the brown bullhead (Ameiurus nebulosus). These stubborn fish prefer 
aging, warm-water lakes and are typically found east of the North American Continental 
divide. 

Bullheads are known for their obstinacy and tenacity (in fact, people possessing these 
traits are often characterized as “bullheaded”), and these characteristics have helped 
them outlive many other species of fish. They can tolerate brackish water and low 
oxygen and high carbon dioxide levels, making them more resistant to pollutants than 
most other fish and therefore ideal for lab and medical experiments. 

Bullheads are bottom-feeders, prowling in schools at night, on the hunt for clams, 
insects, leeches, small fish, crayfish, and algae. While doing so, they tend to stir up 
the bottom of the body of water and destroy aquatio vegetation, which eliminates 
cover for other species. This, along with their high reproductive rate and subsequent 
overpopulation, can make them somewhat of a curse for fisheries. For this reason, 
restrictions are rarely placed on bullhead fishing. 

Ali three types of bullhead are scaleless and average 8 to 10 inches in length. The 
whisker-like barbells, or feelers, at the corners of their mouth and in a line on the lower 
chin give them a catfish-like appearance, but theyVe differentiated partly by the sharp 
spines at the base of their dorsal and pectoral fins. Like catfish, though, their sense of 
smell is more developed than that of canines. 

The cover image is from Woods Animate Creation. The cover font is Adobe ITC Gara- 
mond. The text font is Linotype Birka; the heading font is Adobe Myriad Condensed; 
and the code font is LucasFonts TheSansMonoCondensed. 
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